[00:00:05] twentyafterfour: I, the Bot under the Fountain, call upon thee, The Deployer, to do Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210826T0000). [00:01:19] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:03:15] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:03:15] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:14:23] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:18:01] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:23:01] 10SRE, 10serviceops, 10Patch-For-Review: Run httpbb periodically - https://phabricator.wikimedia.org/T289202 (10RLazarus) Hourly appserver tests are running on both cumin1001 (to mw1414) and cumin2001 (to mw2271). Weirdly, the tests time out in eqiad about half the time: ` Aug 25 20:53:28 cumin1001 systemd[... [00:38:19] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:40:09] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:48:37] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:50:25] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:51:11] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:56:43] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:00:18] 10SRE, 10SRE-swift-storage, 10MediaWiki-extensions-Score, 10Performance-Team (Radar): Add cache key information to metadata json - https://phabricator.wikimedia.org/T257093 (10Krinkle) +1 for adding the `CACHE_VERSION` to the file path. That way, after parser cache rollsover during 30 days, all previous fi... [01:03:21] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:05:11] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:09:39] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:11:27] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:16:11] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:19:57] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:34:23] PROBLEM - MariaDB Replica Lag: s4 on db2097 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1083.61 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:44:55] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:46:45] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:58:09] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:59:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:06:29] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:15:23] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:18:01] (03CR) 10Huji: [C: 03+1] Install Extension Quiz on fa.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714872 (https://phabricator.wikimedia.org/T289381) (owner: 104nn1l2) [02:25:04] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:28:24] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:37:24] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:39:14] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:48:20] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:50:10] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:14:16] RECOVERY - MariaDB Replica Lag: s4 on db2097 is OK: OK slave_sql_lag Replication lag: 0.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:22:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service,monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:32:24] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:32:32] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:54:32] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:56:40] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:57:06] (03CR) 10KartikMistry: [C: 03+1] Rename wgTranslateBlacklist to wgTranslateDisabledTargetLanguages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714770 (owner: 10Nikerabbit) [05:00:02] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:01:10] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:14:02] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:06:01] (03CR) 10Marostegui: bernard: Add simple documentation into README.md (036 comments) [software/bernard] - 10https://gerrit.wikimedia.org/r/714870 (https://phabricator.wikimedia.org/T289735) (owner: 10H.krishna123) [06:09:25] (03PS1) 10Marostegui: install_server: Reimage db1138 with Buster. [puppet] - 10https://gerrit.wikimedia.org/r/714882 (https://phabricator.wikimedia.org/T288803) [06:11:19] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db1138 with Buster. [puppet] - 10https://gerrit.wikimedia.org/r/714882 (https://phabricator.wikimedia.org/T288803) (owner: 10Marostegui) [06:19:21] 10Puppet, 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10elukey) @jbond @Dzahn I got bitten by this problem in production 2/3 times as well (today with an-launcher1... [06:27:10] 10ops-eqiad, 10cloud-services-team (Kanban): cloudcephosd1014.mgmt reported down by icinga - https://phabricator.wikimedia.org/T289755 (10elukey) [06:27:28] ACKNOWLEDGEMENT - Host cloudcephosd1014.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Luca Toscano T289755 [06:33:49] !log elukey@cumin1001 START - Cookbook sre.dns.netbox [06:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:14] !log elukey@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:58] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:43:10] !log Reimage s4 eqiad master (db1138), expect lag on eqiad T288803 [06:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:15] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [06:43:40] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:46:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1160 T288273', diff saved to https://phabricator.wikimedia.org/P17085 and previous config saved to /var/cache/conftool/dbconfig/20210826-064655-marostegui.json [06:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:00] T288273: Please optimize image table in commonswiki - https://phabricator.wikimedia.org/T288273 [06:48:56] !log more weight to ms-be20[62-65] - T288458 [06:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:00] T288458: Put ms-be20[62-65] in service - https://phabricator.wikimedia.org/T288458 [06:49:21] (03PS1) 10Marostegui: db1138: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/714961 (https://phabricator.wikimedia.org/T288803) [06:49:57] (03CR) 10Marostegui: [C: 03+2] db1138: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/714961 (https://phabricator.wikimedia.org/T288803) (owner: 10Marostegui) [06:55:07] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [06:57:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1138.eqiad.wmnet with reason: REIMAGE [06:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:35] PROBLEM - Stale file for node-exporter textfile in eqiad on alert1001 is CRITICAL: cluster=labsnfs file=node_directory_size_bytes.prom instance=labstore1004 job=node site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Stale_file_for_node-exporter_textfile https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [06:59:32] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1138.eqiad.wmnet with reason: REIMAGE [06:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:26] (03PS5) 10David Caro: nova_fullstack: Add last error output when timing out puppet check [puppet] - 10https://gerrit.wikimedia.org/r/714733 (https://phabricator.wikimedia.org/T289663) [07:02:34] (03PS4) 10David Caro: nova_fullstack: try to get the puppet state from a couple places [puppet] - 10https://gerrit.wikimedia.org/r/714761 (https://phabricator.wikimedia.org/T289663) [07:04:47] (03CR) 10Elukey: [C: 03+1] decorators: migrate to the wmflib version [software/spicerack] - 10https://gerrit.wikimedia.org/r/704345 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [07:11:09] (03CR) 10Marostegui: [C: 03+2] dbbackups: Move s4 backup generation from db1130 to db1150/dbprov1003 [puppet] - 10https://gerrit.wikimedia.org/r/714704 (https://phabricator.wikimedia.org/T288803) (owner: 10Jcrespo) [07:12:16] (03CR) 10MMandere: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/711407 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [07:20:28] 10SRE, 10SRE-swift-storage, 10MediaWiki-extensions-Score, 10Performance-Team (Radar): Add cache key information to metadata json - https://phabricator.wikimedia.org/T257093 (10fgiunchedi) Thank you for reaching out @Krinkle, your understanding is correct: within a container what matters for iteration over... [07:25:33] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:25:34] 10SRE-Access-Requests: Requesting access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289757 (10SimoneThisDot) [07:32:20] (03PS1) 10JMeybohm: kubernetes::node: Make use of the disk_type face [puppet] - 10https://gerrit.wikimedia.org/r/714962 (https://phabricator.wikimedia.org/T288345) [07:34:23] 10SRE-Access-Requests: Requesting access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289757 (10RhinosF1) Logstah only needs WMF or NDA LDAP access not shell access [07:36:04] (03PS2) 10JMeybohm: kubernetes::node: Make use of the disk_type fact [puppet] - 10https://gerrit.wikimedia.org/r/714962 (https://phabricator.wikimedia.org/T288345) [07:37:27] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:39:02] 10SRE-Access-Requests: Requesting access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289757 (10RhinosF1) I also don't see any link from your phab account to your WMF one nor any mention of you on https://www.mediawiki.org/wiki/Readers/Structured_Data so I'm not sure how SRE duty team are s... [07:44:07] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:55:39] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:56:25] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:04:21] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:04:41] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:11:40] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache: Support TLS on kafka::statsv [puppet] - 10https://gerrit.wikimedia.org/r/714795 (https://phabricator.wikimedia.org/T286038) (owner: 10Vgutierrez) [08:13:42] (03PS3) 10Vgutierrez: cache: Support TLS on kafka::statsv [puppet] - 10https://gerrit.wikimedia.org/r/714795 (https://phabricator.wikimedia.org/T286038) [08:13:44] (03PS3) 10Vgutierrez: hieradata: Enable SSL for statsv varnishkafka@cp4032 [puppet] - 10https://gerrit.wikimedia.org/r/714796 (https://phabricator.wikimedia.org/T286038) [08:14:11] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Audit usages or the realm variable with a view to drop it - https://phabricator.wikimedia.org/T289661 (10dcaro) a:03dcaro [08:14:17] (03CR) 10Vgutierrez: [C: 03+2] cache: Support TLS on kafka::statsv (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714795 (https://phabricator.wikimedia.org/T286038) (owner: 10Vgutierrez) [08:16:24] (03PS2) 10Nikerabbit: Rename wgTranslateBlacklist to wgTranslateDisabledTargetLanguages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714770 [08:17:48] (03CR) 10Vgutierrez: [C: 03+2] hieradata: Enable SSL for statsv varnishkafka@cp4032 [puppet] - 10https://gerrit.wikimedia.org/r/714796 (https://phabricator.wikimedia.org/T286038) (owner: 10Vgutierrez) [08:20:11] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Audit usages or the realm variable with a view to drop it - https://phabricator.wikimedia.org/T289661 (10dcaro) [08:21:59] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) And some more input regarding RBAC and the replacement of Tiller service account: Currently two different users exist for a service deployment. One is the less privileged viewer... [08:25:43] 10SRE, 10SRE-Access-Requests: Requesting access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289757 (10SimoneThisDot) Hi @RhinosF1 thank you so much for the quick reply. Shall I create another ticket for LDAP or is it possible to just assign the correct tags to this ticket to make sure... [08:27:48] 10SRE, 10SRE-Access-Requests: Requesting access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289757 (10RhinosF1) Regarding the new ticket or not, as long as you give the right info, you can update this. If it's easier, just do a new one. I did some digging and realised you are a contract... [08:29:28] (03CR) 10Filippo Giunchedi: "See inline" [puppet] - 10https://gerrit.wikimedia.org/r/714814 (https://phabricator.wikimedia.org/T289615) (owner: 10Herron) [08:31:00] (03CR) 10Filippo Giunchedi: prometheus: add recording rules for etcd error slo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714814 (https://phabricator.wikimedia.org/T289615) (owner: 10Herron) [08:35:59] (03PS1) 10Vgutierrez: hieradata: Enable SSL cluster wide for statsv varnishkafka [puppet] - 10https://gerrit.wikimedia.org/r/714964 (https://phabricator.wikimedia.org/T286038) [08:37:49] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:38:04] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30842/console" [puppet] - 10https://gerrit.wikimedia.org/r/714964 (https://phabricator.wikimedia.org/T286038) (owner: 10Vgutierrez) [08:41:19] (03CR) 10Vgutierrez: [C: 03+2] cache: Provide an envoy STEK manager script [puppet] - 10https://gerrit.wikimedia.org/r/711407 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [08:42:58] (03PS1) 10Filippo Giunchedi: wmflib: add 'aka' to Service [puppet] - 10https://gerrit.wikimedia.org/r/714965 [08:43:00] (03PS1) 10Filippo Giunchedi: hieradata: add 'aka' for a few services [puppet] - 10https://gerrit.wikimedia.org/r/714966 [08:43:41] 10SRE, 10DNS, 10Traffic: DNS entries for WikiLearn dev servers - https://phabricator.wikimedia.org/T289618 (10Vgutierrez) updating the subscribers list to add "our" Brandon :) [08:44:35] 10SRE, 10DNS, 10Traffic: DNS entries for WikiLearn dev servers - https://phabricator.wikimedia.org/T289618 (10jcrespo) ups, sorry, my bad! [08:46:19] (03CR) 10Filippo Giunchedi: "See also I9a8df370f for usage with existing services (e.g. "swift" vs "ms-fe")" [puppet] - 10https://gerrit.wikimedia.org/r/714965 (owner: 10Filippo Giunchedi) [08:49:20] (03PS1) 10Filippo Giunchedi: pontoon: extend service_names to include 'aka' names [puppet] - 10https://gerrit.wikimedia.org/r/714968 [08:50:18] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM. Code makes sense to my untrained eyes. Explanation / logic is fine and makes perfect sense. May need review if we migrate to a ro" [puppet] - 10https://gerrit.wikimedia.org/r/714767 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond) [08:50:47] (03PS3) 10JMeybohm: kubernetes::node: Make use of the disk_type fact [puppet] - 10https://gerrit.wikimedia.org/r/714962 (https://phabricator.wikimedia.org/T288345) [08:50:49] (03PS1) 10JMeybohm: wmflib: Simplify the structure of disk_type fact [puppet] - 10https://gerrit.wikimedia.org/r/714969 (https://phabricator.wikimedia.org/T288509) [08:52:03] !log restart varnishkafka-statsv on cp4032 [08:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:07] (03CR) 10jerkins-bot: [V: 04-1] wmflib: Simplify the structure of disk_type fact [puppet] - 10https://gerrit.wikimedia.org/r/714969 (https://phabricator.wikimedia.org/T288509) (owner: 10JMeybohm) [08:53:56] (03PS2) 10JMeybohm: wmflib: Simplify the structure of disk_type fact [puppet] - 10https://gerrit.wikimedia.org/r/714969 (https://phabricator.wikimedia.org/T288509) [08:53:59] (03PS4) 10JMeybohm: kubernetes::node: Make use of the disk_type fact [puppet] - 10https://gerrit.wikimedia.org/r/714962 (https://phabricator.wikimedia.org/T288345) [09:01:54] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10JMeybohm) Thanks for writing this up. As discussed already I'm in for option 2 as well as it keeps things "mostly" as they are. As you said someone with access to tiller (e.g. every depl... [09:10:33] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-etcd1001.eqiad.wmnet [09:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:43] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-etcd1001.eqiad.wmnet [09:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:01] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-etcd1002.eqiad.wmnet [09:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:10] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-etcd1002.eqiad.wmnet [09:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:32] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-etcd1003.eqiad.wmnet [09:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:54] elukey@kafka-main1001:~$ kafka acls --add --allow-principal User:CN=varnishkafka --producer --topic statsv - T286038 [09:15:55] T286038: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 [09:17:43] !log restart varnishkafka-statsv on cp4032 to pick up TLS settings [09:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:14] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-etcd1003.eqiad.wmnet [09:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:15] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve-ctrl1001.eqiad.wmnet [09:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:39] !log elukey@kafka-main1001:~$ kafka acls --add --allow-principal User:CN=varnishkafka --producer --topic statsv - T286038 [09:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:43] T286038: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 [09:21:49] forgot to log of course :P [09:22:44] beers-- [09:23:12] Nobody I know likes reading logs, therefore we must produce any! [09:23:13] I mean I thought I logged, but without the !log [09:23:19] +not [09:24:33] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve-ctrl1001.eqiad.wmnet [09:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:37] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:25:43] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:27:37] 10SRE, 10DNS, 10Traffic: DNS entries for WikiLearn dev servers - https://phabricator.wikimedia.org/T289618 (10Brandon) >>! In T289618#7311261, @Vgutierrez wrote: > updating the subscribers list to add "our" Brandon :) I feel so unwanted 🥲 [09:30:47] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1002.eqiad.wmnet [09:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:34] (03PS3) 10Kosta Harlan: GrowthExperiments: Switch image recommendations flag off [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714548 (https://phabricator.wikimedia.org/T288797) [09:31:36] (03PS2) 10Kosta Harlan: [labs] GrowthExperiments: Switch image recommendations flag on [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714549 [09:36:25] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1002.eqiad.wmnet [09:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:31] 10SRE, 10MW-on-K8s, 10Release Pipeline, 10Release-Engineering-Team, 10Kubernetes: Unable to pull restricted/mediawiki-multiversion image to kubestage1002.eqiad.wmnet - https://phabricator.wikimedia.org/T289737 (10JMeybohm) [09:38:38] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-7): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10jijiki) [09:40:08] 10SRE-swift-storage, 10MediaWiki-extensions-Score, 10I18n: Fix mime type and text encoding in Content-Type HTTP header of LilyPond .ly file output - https://phabricator.wikimedia.org/T184871 (10fgiunchedi) #mediawiki-extensions-score seems correct to me, the content-type should be set when mw writes the file... [09:46:37] 10SRE, 10Platform Engineering, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-7): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10ArielGlenn) Platform Engineering will take this, but if there are complications, we'll be back... [09:51:18] 10SRE, 10Platform Engineering, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-7): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10jcrespo) Thank you @ArielGlenn [09:51:19] (03PS1) 10Jbond: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 [09:51:49] (03CR) 10jerkins-bot: [V: 04-1] P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (owner: 10Jbond) [09:58:00] (03PS1) 10Lucas Werkmeister (WMDE): Allow rendering of 0 [extensions/Math] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/714853 (https://phabricator.wikimedia.org/T288846) [09:58:12] (03PS1) 10Lucas Werkmeister (WMDE): Allow rendering of 0 [extensions/Math] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714854 (https://phabricator.wikimedia.org/T288846) [10:00:05] mvolz: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Citoid / Zotero . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210826T1000). [10:01:49] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1001.eqiad.wmnet [10:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:32] (03PS1) 10Vgutierrez: Add learn.wiki DNS zone [dns] - 10https://gerrit.wikimedia.org/r/714976 (https://phabricator.wikimedia.org/T289618) [10:06:52] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1001.eqiad.wmnet [10:06:54] (03PS1) 10Elukey: istio: update config for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/714977 [10:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:31] !log disable puppet on cp-text to merge I52cf2a573980e33487d1f05f19b192ae7d13d717 - T286038 [10:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:35] T286038: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 [10:07:55] (03PS2) 10Elukey: istio: update config for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/714977 [10:07:58] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hieradata: Enable SSL cluster wide for statsv varnishkafka [puppet] - 10https://gerrit.wikimedia.org/r/714964 (https://phabricator.wikimedia.org/T286038) (owner: 10Vgutierrez) [10:09:15] (03CR) 10Klausman: [C: 03+1] istio: update config for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/714977 (owner: 10Elukey) [10:09:21] !log rolling restart of varnishkafka-statsv - T289618 [10:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:25] T289618: DNS entries for WikiLearn dev servers - https://phabricator.wikimedia.org/T289618 [10:09:36] arg... damn clipboard, wrong task [10:09:40] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:11:17] (03CR) 10Physikerwelt: [C: 03+1] Allow rendering of 0 [extensions/Math] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/714853 (https://phabricator.wikimedia.org/T288846) (owner: 10Lucas Werkmeister (WMDE)) [10:11:32] (03CR) 10Physikerwelt: [C: 03+1] Allow rendering of 0 [extensions/Math] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714854 (https://phabricator.wikimedia.org/T288846) (owner: 10Lucas Werkmeister (WMDE)) [10:21:59] (03CR) 10Vgutierrez: [C: 03+2] Add learn.wiki DNS zone [dns] - 10https://gerrit.wikimedia.org/r/714976 (https://phabricator.wikimedia.org/T289618) (owner: 10Vgutierrez) [10:28:06] 10SRE, 10DNS, 10Traffic, 10Patch-For-Review: DNS entries for WikiLearn dev servers - https://phabricator.wikimedia.org/T289618 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez `vgutierrez@carrot:~/wikimedia.org/operations/dns/templates$ host dev.learn.wiki. ns0.wikimedia.org. dev.learn.wiki has addres... [10:28:27] (03PS1) 10Jbond: debdeploy: move debdeploy to its own class [puppet] - 10https://gerrit.wikimedia.org/r/714980 (https://phabricator.wikimedia.org/T289661) [10:29:01] (03PS2) 10Jbond: debdeploy: move debdeploy to its own class [puppet] - 10https://gerrit.wikimedia.org/r/714980 (https://phabricator.wikimedia.org/T289661) [10:31:35] 10SRE, 10Infrastructure-Foundations, 10Datacenter-Switchover, 10User-fgiunchedi: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10Vgutierrez) Thanks for pointing it out. It's been already fixed. Thanks @elukey for his support :) >>! In T286038#73085... [10:40:20] (03PS8) 10Vgutierrez: envoyproxy: Provide support for UDS upstreams [puppet] - 10https://gerrit.wikimedia.org/r/712368 (https://phabricator.wikimedia.org/T271421) [10:40:47] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:43:52] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30844/console" [puppet] - 10https://gerrit.wikimedia.org/r/712368 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [10:44:14] have we rolled to group1? [10:47:20] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10jcrespo) @Jmando One last thing I need from you is- some contractors/researchers/grantees have an end of contact date- If you have one (can be extended later) I will need it to note it on the a... [10:48:57] sDrewth: doesn’t look like it so I guess it’s not US morning yet https://phabricator.wikimedia.org/T281161#7310668 [10:49:11] k, thanks [10:49:31] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10cmooney) a:05CDanis→03cmooney [10:49:33] (03CR) 10Vgutierrez: [V: 03+1] "PCC shows the expected DIFF at puppet level and a NOOP on resulting envoy configuration" [puppet] - 10https://gerrit.wikimedia.org/r/712368 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [10:49:33] trying to find the page where it is listed and cannot [10:49:45] https://versions.toolforge.org/ ? [10:50:24] (03PS3) 10Jbond: debdeploy: move debdeploy to its own class [puppet] - 10https://gerrit.wikimedia.org/r/714980 (https://phabricator.wikimedia.org/T289661) [10:50:52] and that would be why Icouldn't find it on the cloud servers [10:51:57] 10SRE, 10SRE-Access-Requests: Requesting access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289757 (10jcrespo) The easiest way is probably closing this ticket here as invalid and creating a new one following [[ https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?title=Grant%20A... [10:53:37] (03PS4) 10Jbond: debdeploy: move debdeploy to its own class [puppet] - 10https://gerrit.wikimedia.org/r/714980 (https://phabricator.wikimedia.org/T289661) [10:57:37] 10SRE, 10Maps (Tilerator): Externalize tile storage for maps - https://phabricator.wikimedia.org/T196474 (10Jgiannelos) Tegola staging is already using swift for some time already as a tile storage backend and metrics look OK. [10:58:14] 10SRE, 10Maps (Tilerator): Externalize tile storage for maps - https://phabricator.wikimedia.org/T196474 (10Jgiannelos) 05Open→03Resolved a:03Jgiannelos [10:59:05] (03PS5) 10Jbond: debdeploy: move debdeploy to its own class [puppet] - 10https://gerrit.wikimedia.org/r/714980 (https://phabricator.wikimedia.org/T289661) [11:00:04] Amir1, Lucas_WMDE, and apergos: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for EU Backport and Config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210826T1100). [11:00:05] Nikerabbit and Lucas_WMDE: A patch you scheduled for EU Backport and Config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:11] o/ [11:00:25] I'm here. I see you have two patches Lucas_WMDE and I assume you will self serve [11:00:34] \o [11:00:41] I can't tell if Nikerabbit's patch might mean a rebuild of the language cache or not [11:00:44] 10SRE, 10SRE-Access-Requests: Requesting access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289757 (10dr0ptp4kt) Thanks all. I will approve of the logstash access on a new ticket to be filed by Simone. I gather if Simone specifies that it's Logstash, then no LDAP group needs to be specif... [11:00:52] no one has signed up for training [11:00:54] apergos: no it does not need one [11:01:04] Nikerabbit: are you comfortable doing your own deploy? [11:01:05] apergos: yes, though if you happen to be familiar with Kubernetes / Mathoid deployments, I’d actually appreciate some training myself ^^ T289674 [11:01:06] T289674: Deploy new Mathoid version to production - https://phabricator.wikimedia.org/T289674 [11:01:37] apergos: I think so, using the new deploy commands instructions :) [11:02:35] ok! just make sure you read them before copy-paste :-) [11:02:52] Lucas_WMDE: I wish I were.... but sadly no [11:02:59] ok, then I’ll just do the Math backports [11:03:05] Nikerabbit: feel free to start [11:03:18] Lucas_WMDE: thanks, I'm starting [11:03:27] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30847/console" [puppet] - 10https://gerrit.wikimedia.org/r/714980 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [11:03:43] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:04:00] (03CR) 10Jbond: [V: 03+1 C: 03+2] debdeploy: move debdeploy to its own class [puppet] - 10https://gerrit.wikimedia.org/r/714980 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [11:05:28] (03CR) 10Nikerabbit: [C: 03+2] "deploying" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714770 (owner: 10Nikerabbit) [11:05:29] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:06:21] (03Merged) 10jenkins-bot: Rename wgTranslateBlacklist to wgTranslateDisabledTargetLanguages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714770 (owner: 10Nikerabbit) [11:06:53] o_O jenkins-bot can auto-rebase config changes? [11:07:10] I didn’t know that [11:07:15] is that unusual? [11:07:26] well usually I manually rebase just before +2ing [11:07:29] I mean, there is per repo setting whether automatic rebase can be done [11:07:33] and AFAIK other deployers do it as well [11:07:40] now I’m wondering if there’s actually any reason to do that [11:07:46] or if we could’ve just +2ed directly the entire time :D [11:07:50] :-D [11:08:05] if you manually do it, there will be no surprises I guess [11:08:22] I guess so [11:08:37] testing on mwdebug [11:09:19] 10SRE, 10SRE-Access-Requests: Requesting access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289757 (10Peachey88) >>! In T289757#7311537, @dr0ptp4kt wrote: > Thanks all. I will approve of the logstash access on a new ticket to be filed by Simone. I gather if Simone specifies that it's Log... [11:10:06] (03PS2) 10Jbond: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 [11:10:23] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Allow rendering of 0 [extensions/Math] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714854 (https://phabricator.wikimedia.org/T288846) (owner: 10Lucas Werkmeister (WMDE)) [11:10:29] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Allow rendering of 0 [extensions/Math] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/714853 (https://phabricator.wikimedia.org/T288846) (owner: 10Lucas Werkmeister (WMDE)) [11:10:33] I +2ed my backports, they’ll take a while to go through CI anyways [11:10:37] (03CR) 10jerkins-bot: [V: 04-1] P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (owner: 10Jbond) [11:10:40] 10SRE, 10Gerrit, 10GitLab, 10Icinga, and 4 others: RelEng access to downtime alerts in Icinga for gitlab, gerrit, possibly other services? - https://phabricator.wikimedia.org/T289746 (10Dzahn) There are ~ 5 ways to achieve this: a) Only with Icinga configuration- the strict way - We need to have a contact... [11:10:59] is mwdebug1002.eqiad.wmnet supposed to work? [11:11:51] I get [3e65acc4-5878-4502-9349-87a84d7097b5] 2021-08-26 11:11:23: Fatal exception of type "Wikimedia\RequestTimeout\RequestTimeoutException" [11:12:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:06] eqiad is the read-only datacenter right now :) [11:12:13] use one of the codfw mwdebugs instead [11:12:15] 2001 or 2002 [11:12:26] oh, so the command helper is wrong and not alerting about this? :( [11:12:34] oh, it tells you to use 1002? [11:12:36] that’s… not good :/ [11:13:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:37] I’ll open a phab task [11:13:52] is deploy1002.eqiad.wmnet wrong too? [11:14:26] 10SRE, 10Traffic, 10vm-requests: Please create a Ganeti VM for durum in eqiad - https://phabricator.wikimedia.org/T289693 (10Dzahn) a:03Dzahn No problem, I can do this (soon). [11:14:55] no, that’s still correct I think [11:15:09] though I would prefer if it told you the alias “deployment.eqiad.wmnet” instead [11:16:00] I think if you try to SSH into another deploy server it’ll yell at you in large letters “no don’t use this one” in the login banner ^^ [11:16:02] yes deploy1002 is the right deployment host [11:16:09] (or at least it does that on the maintenance script servers) [11:16:15] and yes there is a warning banner to scare you off the wrong ones [11:16:54] yeah, no such warning on debug servers [11:17:56] I filed https://phabricator.wikimedia.org/T289772 [11:18:05] thanks [11:18:13] (03PS3) 10David Caro: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (owner: 10Jbond) [11:18:16] (03PS1) 10David Caro: global: use p:b:production the main entry point [puppet] - 10https://gerrit.wikimedia.org/r/714983 [11:18:57] (03CR) 10jerkins-bot: [V: 04-1] P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (owner: 10Jbond) [11:20:10] !log nikerabbit@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:714770|Rename wgTranslateBlacklist to wgTranslateDisabledTargetLanguages]] (duration: 01m 05s) [11:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:14] Lucas_WMDE: all yours [11:21:19] cool, thanks! [11:21:25] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host durum1001.eqiad.wmnet [11:21:25] still waiting for CI right now :) [11:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:30] 10SRE, 10Traffic, 10vm-requests: Please create a Ganeti VM for durum in eqiad - https://phabricator.wikimedia.org/T289693 (10Dzahn) ` Ready to create Ganeti VM durum1001.eqiad.wmnet in the ganeti01.svc.eqiad.wmnet cluster on row D with 2 vCPUs, 8GB of RAM, 15GB of disk in the private network. ` [11:23:53] 10SRE, 10SRE-Access-Requests: Requesting access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289757 (10jcrespo) 05Open→03Invalid I am going to close the ticket- while we are very grateful for all the people wanting to contribute and help here, I'm afraid it can be confusing for someon... [11:24:04] (03CR) 10jerkins-bot: [V: 04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/714987 (owner: 10L10n-bot) [11:24:10] (03CR) 10Jbond: global: use p:b:production the main entry point (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714983 (owner: 10David Caro) [11:24:36] 10SRE-swift-storage, 10MediaWiki-extensions-Score, 10I18n: Fix mime type and text encoding in Content-Type HTTP header of LilyPond .ly file output - https://phabricator.wikimedia.org/T184871 (10TheDJ) I guess this is a FileBackend detail.. I see a streamMimeFunc here: https://github.com/wikimedia/mediawiki... [11:26:03] (03PS1) 10Jbond: P:debdeploy::client: Add debdeploy profile [puppet] - 10https://gerrit.wikimedia.org/r/714991 (https://phabricator.wikimedia.org/T289661) [11:26:33] (03CR) 10jerkins-bot: [V: 04-1] P:debdeploy::client: Add debdeploy profile [puppet] - 10https://gerrit.wikimedia.org/r/714991 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [11:28:46] (03PS2) 10Jbond: P:debdeploy::client: Add debdeploy profile [puppet] - 10https://gerrit.wikimedia.org/r/714991 (https://phabricator.wikimedia.org/T289661) [11:28:53] still waiting for the merges I guess Lucas_WMDE? [11:29:03] yup, though probably not much longer [11:29:07] ok [11:29:32] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30849/console" [puppet] - 10https://gerrit.wikimedia.org/r/714991 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [11:30:02] I'll drop Tyler a message that having trainings around mathoid/other oiddeployments on kube might be nice [11:30:52] cool, thanks [11:31:20] I guess that’s different from new deployments on kubernetes? the k8s-experimental thing? [11:31:48] I don't know at all! [11:31:55] I've never done one of any of them [11:32:08] ok [11:32:26] (03Merged) 10jenkins-bot: Allow rendering of 0 [extensions/Math] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714854 (https://phabricator.wikimedia.org/T288846) (owner: 10Lucas Werkmeister (WMDE)) [11:32:29] ayy [11:32:31] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum1001.eqiad.wmnet [11:32:32] (03Merged) 10jenkins-bot: Allow rendering of 0 [extensions/Math] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/714853 (https://phabricator.wikimedia.org/T288846) (owner: 10Lucas Werkmeister (WMDE)) [11:32:33] wooo! [11:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:39] let’s start with wmf.20 [11:32:44] (03PS3) 10Jbond: P:debdeploy::client: Add debdeploy profile [puppet] - 10https://gerrit.wikimedia.org/r/714991 (https://phabricator.wikimedia.org/T289661) [11:33:16] testing on mwdebug2001… [11:33:22] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30850/console" [puppet] - 10https://gerrit.wikimedia.org/r/714991 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [11:33:40] yay, looking good on https://test.wikipedia.org/wiki/User:Lucas_Werkmeister_(WMDE)/sandbox [11:34:22] excellent! [11:34:55] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30851/console" [puppet] - 10https://gerrit.wikimedia.org/r/714991 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [11:34:57] eh, I guess my log message with might actually render in the SAL page on Wikitech ^^ [11:35:00] let’s see [11:35:01] (03PS1) 10Dzahn: DHCP: add MAC address for durum1001 [puppet] - 10https://gerrit.wikimedia.org/r/714996 (https://phabricator.wikimedia.org/T289693) [11:35:25] (03PS1) 10DCausse: flink-session-cluster: Add support for elastic ECS logger [deployment-charts] - 10https://gerrit.wikimedia.org/r/714997 (https://phabricator.wikimedia.org/T289275) [11:35:35] (03CR) 10jerkins-bot: [V: 04-1] DHCP: add MAC address for durum1001 [puppet] - 10https://gerrit.wikimedia.org/r/714996 (https://phabricator.wikimedia.org/T289693) (owner: 10Dzahn) [11:35:43] (03CR) 10Dzahn: [C: 03+2] DHCP: add MAC address for durum1001 [puppet] - 10https://gerrit.wikimedia.org/r/714996 (https://phabricator.wikimedia.org/T289693) (owner: 10Dzahn) [11:35:48] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.37.0-wmf.20/extensions/Math/src/HookHandlers/ParserHooksHandler.php: Backport: [[gerrit:714854|Allow rendering of 0 (T288846)]] (duration: 01m 05s) [11:35:51] (03CR) 10jerkins-bot: [V: 04-1] flink-session-cluster: Add support for elastic ECS logger [deployment-charts] - 10https://gerrit.wikimedia.org/r/714997 (https://phabricator.wikimedia.org/T289275) (owner: 10DCausse) [11:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:53] T288846: TypeError: Argument 1 passed to HookHandlers\ParserHooksHandler::mathTagHook() must be of the type string, null given - https://phabricator.wikimedia.org/T288846 [11:36:18] Backport: Allow rendering of (T288846) [11:36:24] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30852/console" [puppet] - 10https://gerrit.wikimedia.org/r/714991 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [11:36:25] that's what shows in the log on wikitech :-P [11:36:33] I’ll nowiki it :D [11:36:37] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:debdeploy::client: Add debdeploy profile [puppet] - 10https://gerrit.wikimedia.org/r/714991 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [11:36:43] (well that's the relevant bit, I dropped the first part) [11:36:43] but I thought having in the sal.toolforge.org version would look weirder [11:36:49] (03PS2) 10Dzahn: DHCP: add MAC address for durum1001 [puppet] - 10https://gerrit.wikimedia.org/r/714996 (https://phabricator.wikimedia.org/T289693) [11:36:51] anyways, testing wmf.19 first [11:37:25] 👍 [11:37:40] (03PS1) 10Ssingh: durum: add role insetup [puppet] - 10https://gerrit.wikimedia.org/r/714998 (https://phabricator.wikimedia.org/T289536) [11:37:42] (03CR) 10Dzahn: [C: 03+2] DHCP: add MAC address for durum1001 [puppet] - 10https://gerrit.wikimedia.org/r/714996 (https://phabricator.wikimedia.org/T289693) (owner: 10Dzahn) [11:38:15] save by the typos CI check - thanks jerkins [11:38:16] (03Abandoned) 10Nikerabbit: Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/708091 (owner: 10L10n-bot) [11:38:27] (03Abandoned) 10Nikerabbit: Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/708747 (owner: 10L10n-bot) [11:38:32] (03Abandoned) 10Nikerabbit: Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/706390 (owner: 10L10n-bot) [11:38:37] (03Abandoned) 10Nikerabbit: Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/709403 (owner: 10L10n-bot) [11:38:43] (03Abandoned) 10Nikerabbit: Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/710238 (owner: 10L10n-bot) [11:38:48] (03Abandoned) 10Nikerabbit: Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/710954 (owner: 10L10n-bot) [11:38:53] (03Abandoned) 10Nikerabbit: Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/712227 (owner: 10L10n-bot) [11:38:57] (03Abandoned) 10Nikerabbit: Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/713244 (owner: 10L10n-bot) [11:39:03] (03Abandoned) 10Nikerabbit: Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/713837 (owner: 10L10n-bot) [11:39:18] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.37.0-wmf.19/extensions/Math/src/HookHandlers/ParserHooksHandler.php: Backport: [[gerrit:714853|Allow rendering of 0 (T288846)]] (duration: 01m 04s) [11:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:21] (03CR) 10Dzahn: [C: 03+2] durum: add role insetup [puppet] - 10https://gerrit.wikimedia.org/r/714998 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [11:39:27] (03Abandoned) 10Nikerabbit: Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/714343 (owner: 10L10n-bot) [11:39:50] apergos: now that wikitech has received the backport, it renders as “rendering of 0” ^^ [11:39:55] (03PS2) 10DCausse: flink-session-cluster: Add support for elastic ECS logger [deployment-charts] - 10https://gerrit.wikimedia.org/r/714997 (https://phabricator.wikimedia.org/T289275) [11:39:57] (03PS4) 10Jbond: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 [11:40:08] very nice! [11:40:26] (03CR) 10Nikerabbit: "This is a bit embarrassing, but we failed to notice l10n-bot-watcher is missing V+2 that is prerequisite for submitting. Can you add it to" [software/mailman-templates] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/709976 (https://phabricator.wikimedia.org/T288027) (owner: 10Hashar) [11:40:29] (03CR) 10jerkins-bot: [V: 04-1] P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (owner: 10Jbond) [11:40:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:49] !log EU backport+config window done [11:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:59] no one else snuck in a patch so that's it for the window [11:41:29] we identified a bug in the deploy-commands tool, that seems like a productive training session to me [11:42:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:50] (03CR) 10Nikerabbit: "FYI: You may want to try going through +2 route while we try to get the permissions for l10n-bot-watcher resolved." [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/714987 (owner: 10L10n-bot) [11:44:10] true enough [11:45:28] (03PS3) 10DCausse: flink-session-cluster: Add support for elastic ECS logger [deployment-charts] - 10https://gerrit.wikimedia.org/r/714997 (https://phabricator.wikimedia.org/T289275) [11:46:40] (03PS1) 10Dzahn: install_server: add durum to partman, standard VM recipe [puppet] - 10https://gerrit.wikimedia.org/r/715001 (https://phabricator.wikimedia.org/T289536) [11:47:37] (03CR) 10Dzahn: [C: 03+2] install_server: add durum to partman, standard VM recipe [puppet] - 10https://gerrit.wikimedia.org/r/715001 (https://phabricator.wikimedia.org/T289536) (owner: 10Dzahn) [11:48:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:22] (03PS5) 10Jbond: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 [11:49:57] (03CR) 10Jbond: "i made some changes and rebased, left some comments inline explaining" [puppet] - 10https://gerrit.wikimedia.org/r/714975 (owner: 10Jbond) [11:50:08] (03CR) 10jerkins-bot: [V: 04-1] P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (owner: 10Jbond) [11:50:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:45] (03PS6) 10Jbond: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 [11:53:16] (03CR) 10jerkins-bot: [V: 04-1] P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (owner: 10Jbond) [11:57:18] (03PS4) 10DCausse: flink-session-cluster: Add support for elastic ECS logger [deployment-charts] - 10https://gerrit.wikimedia.org/r/714997 (https://phabricator.wikimedia.org/T289275) [11:58:30] 10SRE, 10Gerrit, 10GitLab, 10Icinga, and 4 others: RelEng access to downtime alerts in Icinga for gitlab, gerrit, possibly other services? - https://phabricator.wikimedia.org/T289746 (10jcrespo) #SRE_Observability may want to weight in, as I know they have been working on similar request with alertmanager,... [12:04:42] (03PS5) 10DCausse: flink-session-cluster: Add support for elastic ECS logger [deployment-charts] - 10https://gerrit.wikimedia.org/r/714997 (https://phabricator.wikimedia.org/T289275) [12:10:17] (03CR) 10Elukey: [C: 03+2] istio: update config for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/714977 (owner: 10Elukey) [12:14:55] (03PS1) 10Jbond: P:standard: move admin to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/715003 (https://phabricator.wikimedia.org/T289661) [12:16:27] (03CR) 10jerkins-bot: [V: 04-1] P:standard: move admin to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/715003 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [12:18:31] (03PS2) 10Jbond: P:standard: move admin to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/715003 (https://phabricator.wikimedia.org/T289661) [12:20:02] (03CR) 10jerkins-bot: [V: 04-1] P:standard: move admin to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/715003 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [12:21:21] !log running puppet initial run on durum1001.eqiad.wmnet - T289536 [12:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:25] T289536: Deploy durum: check service for Wikidough - https://phabricator.wikimedia.org/T289536 [12:21:37] (03PS3) 10Jbond: P:standard: move admin to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/715003 (https://phabricator.wikimedia.org/T289661) [12:22:32] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30856/console" [puppet] - 10https://gerrit.wikimedia.org/r/715003 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [12:23:07] (03CR) 10jerkins-bot: [V: 04-1] P:standard: move admin to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/715003 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [12:25:17] (03PS4) 10Jbond: P:standard: move admin to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/715003 (https://phabricator.wikimedia.org/T289661) [12:25:25] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10cmooney) @RobH I've updated that Wiki page now with instructions on how to create the USB disk image and begin the install. https://wikitech.w... [12:26:59] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10cmooney) [12:32:04] (03PS7) 10Jbond: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [12:32:42] (03CR) 10jerkins-bot: [V: 04-1] P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [12:33:15] (03PS1) 10Elukey: knative-serving: add missing vars for replicaCount [deployment-charts] - 10https://gerrit.wikimedia.org/r/715006 (https://phabricator.wikimedia.org/T278194) [12:33:51] (03PS1) 10Ssingh: Add durum1001 to BGP anycast in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/715007 (https://phabricator.wikimedia.org/T289536) [12:34:17] (03PS2) 10Elukey: knative-serving: add missing vars for replicaCount [deployment-charts] - 10https://gerrit.wikimedia.org/r/715006 (https://phabricator.wikimedia.org/T278194) [12:35:37] (03PS8) 10Jbond: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [12:36:20] (03PS3) 10Elukey: knative-serving: add missing vars for replicaCount [deployment-charts] - 10https://gerrit.wikimedia.org/r/715006 (https://phabricator.wikimedia.org/T278194) [12:36:25] (03CR) 10Jbond: "PCC (still running): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30857" [puppet] - 10https://gerrit.wikimedia.org/r/715003 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [12:40:43] (03PS7) 10MMandere: varnish: Containerize varnish test environment [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) [12:41:52] (03CR) 10Klausman: [C: 03+1] knative-serving: add missing vars for replicaCount [deployment-charts] - 10https://gerrit.wikimedia.org/r/715006 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [12:43:50] 10SRE, 10SRE-swift-storage, 10Thumbor, 10Traffic: Thumbnail of deleted image shown in "File history" after new image with same filename got uploaded - https://phabricator.wikimedia.org/T281780 (10jcrespo) @AntiCompositeNumber I tried purging it with no success- my guess would be that it is not on cache, bu... [12:53:43] (03PS1) 10Jbond: P:base: Create profile::apt [puppet] - 10https://gerrit.wikimedia.org/r/715010 (https://phabricator.wikimedia.org/T289661) [12:54:05] (03CR) 10Elukey: [C: 03+2] knative-serving: add missing vars for replicaCount [deployment-charts] - 10https://gerrit.wikimedia.org/r/715006 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [12:54:15] (03CR) 10jerkins-bot: [V: 04-1] P:base: Create profile::apt [puppet] - 10https://gerrit.wikimedia.org/r/715010 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [12:55:32] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30858/console" [puppet] - 10https://gerrit.wikimedia.org/r/715010 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [12:56:25] (03PS2) 10Jbond: P:base: Create profile::apt [puppet] - 10https://gerrit.wikimedia.org/r/715010 (https://phabricator.wikimedia.org/T289661) [12:56:49] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [12:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:56] (03CR) 10jerkins-bot: [V: 04-1] P:base: Create profile::apt [puppet] - 10https://gerrit.wikimedia.org/r/715010 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [12:57:09] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [12:57:11] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30859/console" [puppet] - 10https://gerrit.wikimedia.org/r/715010 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [12:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:16] (03CR) 10Jbond: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/714799 (https://phabricator.wikimedia.org/T287465) (owner: 10Volans) [12:57:42] (03CR) 10Jbond: [C: 03+1] wmcs: remove wmcs/ subtree [cookbooks] - 10https://gerrit.wikimedia.org/r/714798 (https://phabricator.wikimedia.org/T287465) (owner: 10Volans) [12:58:48] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/714797 (https://phabricator.wikimedia.org/T287465) (owner: 10Volans) [12:59:52] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1003.eqiad.wmnet [12:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:37] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:04:39] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1003.eqiad.wmnet [13:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:24] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/714962 (https://phabricator.wikimedia.org/T288345) (owner: 10JMeybohm) [13:09:30] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/714969 (https://phabricator.wikimedia.org/T288509) (owner: 10JMeybohm) [13:10:25] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:12:54] (03CR) 10JMeybohm: [C: 03+2] wmflib: Simplify the structure of disk_type fact [puppet] - 10https://gerrit.wikimedia.org/r/714969 (https://phabricator.wikimedia.org/T288509) (owner: 10JMeybohm) [13:15:09] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1004.eqiad.wmnet [13:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:40] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1004.eqiad.wmnet [13:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:46] 10SRE, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech: Add HTTPS support to wdqs-internal service - https://phabricator.wikimedia.org/T193473 (10Gehel) p:05Medium→03High [13:23:10] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) There is a ClusterRole named `deploy` already for the aggregation of `view` and `pods/portForward` permissions. So I would prefer using the names `` and ` (03PS5) 10Andrew Bogott: nova_fullstack: try to get the puppet state from a couple places [puppet] - 10https://gerrit.wikimedia.org/r/714761 (https://phabricator.wikimedia.org/T289663) (owner: 10David Caro) [13:25:21] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:28:20] (03CR) 10Jbond: [V: 03+1] "PCC still running https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30860/" [puppet] - 10https://gerrit.wikimedia.org/r/715010 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [13:29:43] (03PS3) 10Jbond: P:base: Create profile::apt [puppet] - 10https://gerrit.wikimedia.org/r/715010 (https://phabricator.wikimedia.org/T289661) [13:30:49] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:30:59] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:32:39] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:32:49] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:35:42] (03PS1) 10Lucas Werkmeister (WMDE): Set $wgWBRepoSettings['tmpNormalizeDataValues'] on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715018 (https://phabricator.wikimedia.org/T251480) [13:36:07] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-2] "To be deployed on 6 September, not before (see announcement: https://lists.wikimedia.org/hyperkitty/list/wikidata-tech@lists.wikimedia.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715018 (https://phabricator.wikimedia.org/T251480) (owner: 10Lucas Werkmeister (WMDE)) [13:43:27] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:45:15] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:47:39] (03PS1) 10JMeybohm: rsyslog/kubernetes: Add explicit name to mmkubernetes action [puppet] - 10https://gerrit.wikimedia.org/r/715024 (https://phabricator.wikimedia.org/T289766) [13:49:41] (03PS1) 10Volans: quotereviewer: add support for portal quotes [software] - 10https://gerrit.wikimedia.org/r/715025 (https://phabricator.wikimedia.org/T288354) [13:51:42] (03PS1) 10Andrew Bogott: Added cloud-wide default for profile::debdeploy::client::filter_services: [puppet] - 10https://gerrit.wikimedia.org/r/715026 (https://phabricator.wikimedia.org/T289663) [13:55:47] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, and 2 others: LLDP: Ganeti hosts dont correctly report lldp_parent - https://phabricator.wikimedia.org/T289679 (10jcrespo) a:03jbond I am assigning this to you to reflect the fact that you seem to have created a fix or workaround for it- feel free... [13:57:21] (03CR) 10Jbond: [C: 03+1] "+1" [puppet] - 10https://gerrit.wikimedia.org/r/715026 (https://phabricator.wikimedia.org/T289663) (owner: 10Andrew Bogott) [13:57:47] (03CR) 10Andrew Bogott: [C: 03+2] Added cloud-wide default for profile::debdeploy::client::filter_services: [puppet] - 10https://gerrit.wikimedia.org/r/715026 (https://phabricator.wikimedia.org/T289663) (owner: 10Andrew Bogott) [13:58:10] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Chmielko Maslak - https://phabricator.wikimedia.org/T289257 (10jcrespo) p:05Triage→03High [13:58:12] (03CR) 10David Caro: [C: 03+1] Added cloud-wide default for profile::debdeploy::client::filter_services: [puppet] - 10https://gerrit.wikimedia.org/r/715026 (https://phabricator.wikimedia.org/T289663) (owner: 10Andrew Bogott) [13:58:17] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Kate Levan - https://phabricator.wikimedia.org/T289258 (10jcrespo) p:05Triage→03High [13:58:31] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Nathan Forrester - https://phabricator.wikimedia.org/T289259 (10jcrespo) p:05Triage→03High [13:59:42] (03PS1) 10Jgiannelos: Configure event stream for map tile expiration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715028 (https://phabricator.wikimedia.org/T289771) [14:00:46] (03PS36) 10Btullis: Install Alluxio to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) [14:06:19] 10SRE: Onboarding for Arnold Okoth - https://phabricator.wikimedia.org/T288645 (10jcrespo) p:05Triage→03High If you don't mind me triaging this ticket/moving it in the column, so clinic duty can identify more easily other #SRE unattended tasks 0:-) Onboarding a workmate looks to me like an important thing to... [14:07:21] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [14:09:40] (03CR) 10Jbond: [C: 03+2] P:base: Create profile::apt [puppet] - 10https://gerrit.wikimedia.org/r/715010 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [14:09:49] (03PS4) 10Jbond: P:base: Create profile::apt [puppet] - 10https://gerrit.wikimedia.org/r/715010 (https://phabricator.wikimedia.org/T289661) [14:11:17] (03PS1) 10Andrew Bogott: Revert "nova_fullstack: try to get the puppet state from a couple places" [puppet] - 10https://gerrit.wikimedia.org/r/714858 [14:11:51] (03CR) 10David Caro: "This will require changes on a couple project's puppet config (through horizon Ui):" [puppet] - 10https://gerrit.wikimedia.org/r/715003 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [14:11:58] (03CR) 10jerkins-bot: [V: 04-1] Revert "nova_fullstack: try to get the puppet state from a couple places" [puppet] - 10https://gerrit.wikimedia.org/r/714858 (owner: 10Andrew Bogott) [14:21:39] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:22:07] (03PS1) 10Ssingh: acme_chief: authorize durum1001 host for durum [puppet] - 10https://gerrit.wikimedia.org/r/715029 (https://phabricator.wikimedia.org/T289536) [14:22:17] (03CR) 10Filippo Giunchedi: [C: 03+1] rsyslog/kubernetes: Add explicit name to mmkubernetes action [puppet] - 10https://gerrit.wikimedia.org/r/715024 (https://phabricator.wikimedia.org/T289766) (owner: 10JMeybohm) [14:23:27] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:23:54] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30863/console" [puppet] - 10https://gerrit.wikimedia.org/r/715029 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [14:24:14] !log start of mwscript extensions/FlaggedRevs/maintenance/pruneRevData.php --wiki=plwiki --prune --batch-size=10 --sleep=2 (T289249) [14:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:19] T289249: flaggedtemplates table should not keep the whole history of all revisions - https://phabricator.wikimedia.org/T289249 [14:25:41] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10JMando) @jcrespo Right now that date is July 19, 2022. [14:27:07] 10SRE, 10Datacenter-Switchover, 10User-notice: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 (10Trizek-WMF) [14:27:18] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10jcrespo) Thank you! Now only waiting on #Analytics final approval. [14:31:25] (03CR) 10Jbond: [C: 03+2] lldp fact: updated lldp parent fact to fall back to routers [puppet] - 10https://gerrit.wikimedia.org/r/714767 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond) [14:31:58] (03CR) 10Jbond: [C: 03+2] labstore::drdb_role fact: update facter implementation to ignore stderr [puppet] - 10https://gerrit.wikimedia.org/r/714762 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond) [14:33:45] (03PS1) 10Jbond: lldp: remove debug [puppet] - 10https://gerrit.wikimedia.org/r/715031 [14:34:05] (03CR) 10Jbond: [V: 03+2 C: 03+2] lldp: remove debug [puppet] - 10https://gerrit.wikimedia.org/r/715031 (owner: 10Jbond) [14:34:19] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:36:09] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:38:15] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [14:38:19] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, and 2 others: LLDP: Ganeti hosts dont correctly report lldp_parent - https://phabricator.wikimedia.org/T289679 (10jbond) thanks jcrespo this has now been fixed ` lang=console,name=ganeti5001 $ sudo facter -p lldp_parent... [14:38:28] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, and 2 others: LLDP: Ganeti hosts dont correctly report lldp_parent - https://phabricator.wikimedia.org/T289679 (10jbond) 05Open→03Resolved [14:38:49] (03PS5) 10Jbond: P:standard: move admin to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/715003 (https://phabricator.wikimedia.org/T289661) [14:39:00] (03PS9) 10Jbond: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [14:40:53] (03PS1) 10Filippo Giunchedi: o11y: add prometheus alerts [alerts] - 10https://gerrit.wikimedia.org/r/715032 (https://phabricator.wikimedia.org/T288726) [14:42:24] (03PS1) 10Filippo Giunchedi: prometheus: remove alerts moved to AM [puppet] - 10https://gerrit.wikimedia.org/r/715033 (https://phabricator.wikimedia.org/T288726) [14:44:01] (03PS1) 10Ssingh: site: change role of durum1001 from insetup to durum [puppet] - 10https://gerrit.wikimedia.org/r/715034 [14:44:58] (03CR) 10Ssingh: [C: 03+2] site: change role of durum1001 from insetup to durum [puppet] - 10https://gerrit.wikimedia.org/r/715034 (owner: 10Ssingh) [14:46:59] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:54:52] (03PS1) 10Ssingh: durum: intial commit [puppet] - 10https://gerrit.wikimedia.org/r/715038 (https://phabricator.wikimedia.org/T289536) [14:56:58] marostegui at https://phabricator.wikimedia.org/T289249#7304088 there is a row missing a wiki, "| 1009754433 | |" - is this intentional? [14:58:29] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:59:26] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM. Look forward to trying it out when it's up!" [homer/public] - 10https://gerrit.wikimedia.org/r/715007 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [14:59:26] DannyS712: no, that's what the SQL query returned [14:59:37] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:59:54] oh - is it possible to figure out which wiki that is for? [15:02:53] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [15:03:19] (03PS1) 10Ssingh: durum: introduce role durum [puppet] - 10https://gerrit.wikimedia.org/r/715041 [15:04:43] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:06:31] DannyS712: maybe, I'm on holidays so maybe next week [15:06:59] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:07:29] (03PS1) 10Elukey: kubeflow: raise cpu limits for the kfserving controller [deployment-charts] - 10https://gerrit.wikimedia.org/r/715042 (https://phabricator.wikimedia.org/T272919) [15:07:57] there are a lot of errors in the icinga config, all related to parents for ganeti, jbond might be related to your lldp change? [15:08:03] example [15:08:03] Error: 'deneb.codfw.wmnet' is not a valid parent for host 'ganeti2008' (file '/etc/icinga/objects/puppet_hosts.cfg', line 23089)! [15:08:40] (03PS2) 10Elukey: kubeflow: raise cpu limits for the kfserving controller [deployment-charts] - 10https://gerrit.wikimedia.org/r/715042 (https://phabricator.wikimedia.org/T272919) [15:09:09] (03CR) 10Ssingh: [C: 03+2] durum: introduce role durum [puppet] - 10https://gerrit.wikimedia.org/r/715041 (owner: 10Ssingh) [15:10:45] volans: dose this fix it. in a meeting right now but can pull out if needs be https://gerrit.wikimedia.org/r/c/operations/puppet/+/715031. [15:11:20] (03PS3) 10Elukey: kubeflow: raise cpu limits for the kfserving controller [deployment-charts] - 10https://gerrit.wikimedia.org/r/715042 (https://phabricator.wikimedia.org/T272919) [15:11:30] jbond: not sure, trying it right now I'll let you know [15:11:48] thx [15:12:25] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:13:29] (03PS2) 10Ssingh: durum: intial commit [puppet] - 10https://gerrit.wikimedia.org/r/715038 (https://phabricator.wikimedia.org/T289536) [15:13:43] (03CR) 10Vgutierrez: varnish: Containerize varnish test environment (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere) [15:16:30] 10SRE, 10Datasets-General-or-Unknown, 10Dumps-Generation, 10Infrastructure-Foundations, 10netops: Packets discarded on dumpsdata1001 - https://phabricator.wikimedia.org/T273713 (10dcaro) [15:16:52] (03PS1) 10Volans: mariadb: add section to alert name [puppet] - 10https://gerrit.wikimedia.org/r/715043 [15:17:47] 10SRE, 10Release-Engineering-Team: Puppet failure on deployment-kafka-jumbo-3.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T289782 (10dancy) [15:17:53] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:18:16] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-7), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10WDoranWMF) [15:18:34] (03PS2) 10Andrew Bogott: Revert "nova_fullstack: try to get the puppet state from a couple places" [puppet] - 10https://gerrit.wikimedia.org/r/714858 (https://phabricator.wikimedia.org/T289663) [15:18:37] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-7), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10WDoranWMF) p:05Triage→03High [15:19:14] (03CR) 10jerkins-bot: [V: 04-1] Revert "nova_fullstack: try to get the puppet state from a couple places" [puppet] - 10https://gerrit.wikimedia.org/r/714858 (https://phabricator.wikimedia.org/T289663) (owner: 10Andrew Bogott) [15:19:59] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 4 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [15:20:07] (03PS3) 10Andrew Bogott: Revert "nova_fullstack: try to get the puppet state from a couple places" [puppet] - 10https://gerrit.wikimedia.org/r/714858 (https://phabricator.wikimedia.org/T289663) [15:20:36] (03CR) 10Alexandros Kosiaris: [C: 03+1] rsyslog/kubernetes: Add explicit name to mmkubernetes action [puppet] - 10https://gerrit.wikimedia.org/r/715024 (https://phabricator.wikimedia.org/T289766) (owner: 10JMeybohm) [15:21:33] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:21:47] (03CR) 10Andrew Bogott: [C: 03+2] Revert "nova_fullstack: try to get the puppet state from a couple places" [puppet] - 10https://gerrit.wikimedia.org/r/714858 (https://phabricator.wikimedia.org/T289663) (owner: 10Andrew Bogott) [15:22:07] (03CR) 10BBlack: "Looking pretty good! Various minor stuff replied/noted inline." [dns] - 10https://gerrit.wikimedia.org/r/711577 (owner: 10BBlack) [15:26:30] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team (Doing): Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629 (10dancy) [15:28:27] (03CR) 10Thcipriani: Review access change (031 comment) [software/mailman-templates] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/709976 (https://phabricator.wikimedia.org/T288027) (owner: 10Hashar) [15:30:05] (03PS1) 10Andrew Bogott: nova vendordata: try to have cloud-init perform the first puppet run [puppet] - 10https://gerrit.wikimedia.org/r/715045 (https://phabricator.wikimedia.org/T289663) [15:32:37] (03PS1) 10Jbond: Revert "lldp fact: updated lldp parent fact to fall back to routers" [puppet] - 10https://gerrit.wikimedia.org/r/714861 [15:32:54] (03CR) 10jerkins-bot: [V: 04-1] Revert "lldp fact: updated lldp parent fact to fall back to routers" [puppet] - 10https://gerrit.wikimedia.org/r/714861 (owner: 10Jbond) [15:33:25] PROBLEM - kartotherian endpoints health on maps2010 is CRITICAL: /osm-intl/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [15:33:47] (03CR) 10Jbond: "see also https://phabricator.wikimedia.org/P17087" [puppet] - 10https://gerrit.wikimedia.org/r/714861 (owner: 10Jbond) [15:34:57] (03PS2) 10Jbond: Revert "lldp fact: updated lldp parent fact to fall back to routers" [puppet] - 10https://gerrit.wikimedia.org/r/714861 [15:35:15] RECOVERY - kartotherian endpoints health on maps2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [15:36:41] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10Papaul) @cmooney USB in place [15:37:17] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/714861 (owner: 10Jbond) [15:37:48] (03CR) 10Jbond: [C: 03+2] Revert "lldp fact: updated lldp parent fact to fall back to routers" [puppet] - 10https://gerrit.wikimedia.org/r/714861 (owner: 10Jbond) [15:38:28] (03CR) 10Elukey: [C: 03+2] kubeflow: raise cpu limits for the kfserving controller [deployment-charts] - 10https://gerrit.wikimedia.org/r/715042 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [15:38:32] (03PS1) 10Jbond: lldp fact: updated lldp parent fact to fall back to routers [puppet] - 10https://gerrit.wikimedia.org/r/714862 (https://phabricator.wikimedia.org/T289679) [15:39:15] (03CR) 10Jbond: [C: 04-1] "see inline" [puppet] - 10https://gerrit.wikimedia.org/r/714862 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond) [15:40:48] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [15:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:01] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [15:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:36] (03PS6) 10Ssingh: wikidough: add support for the durum check service [dns] - 10https://gerrit.wikimedia.org/r/711577 (owner: 10BBlack) [15:49:58] (03CR) 10Ssingh: "Thanks for the review!" [dns] - 10https://gerrit.wikimedia.org/r/711577 (owner: 10BBlack) [15:50:38] (03CR) 10Ssingh: [C: 03+2] Add durum1001 to BGP anycast in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/715007 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [15:51:14] RECOVERY - Host ripe-atlas-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.47 ms [15:51:17] (03Merged) 10jenkins-bot: Add durum1001 to BGP anycast in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/715007 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [15:52:24] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:52:58] (03CR) 10BBlack: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/711577 (owner: 10BBlack) [15:53:08] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 412 probes of 571 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:53:24] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:53:30] RECOVERY - Host ripe-atlas-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 31.64 ms [15:53:56] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:02] fyi the ganeti parent icinga issues should be fixed now [15:55:38] (03PS1) 10Andrew Bogott: nova_fullstack_test.py: capture output on succesfull puppet check [puppet] - 10https://gerrit.wikimedia.org/r/715050 (https://phabricator.wikimedia.org/T289663) [15:56:14] RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [15:56:28] 10SRE, 10Traffic, 10PM: Clean up Traffic tag/workboard - https://phabricator.wikimedia.org/T289787 (10Aklapper) [15:56:29] !log ran homer for Gerrit 715007: Set up BGP peering to durum1001 in eqiad [15:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:50] (03PS2) 10Andrew Bogott: nova_fullstack_test.py: capture output on succesful puppet check [puppet] - 10https://gerrit.wikimedia.org/r/715050 (https://phabricator.wikimedia.org/T289663) [15:57:08] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:34] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10cmooney) [15:58:10] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 10 probes of 571 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:00:05] jbond and rzl: Dear deployers, time to do the Puppet request window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210826T1600). [16:00:36] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:01:48] dancy: do you think i could convince you to +2 and backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/715051 ? sorry i missed that the first time [16:02:01] (03CR) 10Andrew Bogott: [C: 03+2] nova_fullstack_test.py: capture output on succesful puppet check [puppet] - 10https://gerrit.wikimedia.org/r/715050 (https://phabricator.wikimedia.org/T289663) (owner: 10Andrew Bogott) [16:02:37] MatmaRex: Absolutely [16:02:46] 10SRE, 10Traffic, 10PM: Clean up Traffic tag/workboard - https://phabricator.wikimedia.org/T289787 (10Ladsgroup) I'm sorry if it sounds stupid or you already considered it but for the sake of being consistent with most other teams. You can have `Traffic-team` for tracking the ongoing work and `Traffic` stay... [16:03:23] (03PS5) 10JMeybohm: kubernetes::node: Make use of the disk_type fact [puppet] - 10https://gerrit.wikimedia.org/r/714962 (https://phabricator.wikimedia.org/T288345) [16:05:32] (03CR) 10Ssingh: [C: 03+2] wikidough: add support for the durum check service [dns] - 10https://gerrit.wikimedia.org/r/711577 (owner: 10BBlack) [16:05:59] (03PS7) 10Ssingh: wikidough: add support for the durum check service [dns] - 10https://gerrit.wikimedia.org/r/711577 (owner: 10BBlack) [16:07:30] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:08:52] PROBLEM - Host ripe-atlas-codfw is DOWN: PING CRITICAL - Packet loss = 100% [16:09:10] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:10:40] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:11:10] PROBLEM - Host ripe-atlas-codfw IPv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:860:201:208:80:152:244) [16:13:36] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:18:43] (03PS1) 10Ahmon Dancy: PageStore: Pass query flags to getPageById() too [core] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714864 (https://phabricator.wikimedia.org/T289717) [16:19:34] (03CR) 10JMeybohm: [C: 03+2] rsyslog/kubernetes: Add explicit name to mmkubernetes action [puppet] - 10https://gerrit.wikimedia.org/r/715024 (https://phabricator.wikimedia.org/T289766) (owner: 10JMeybohm) [16:20:14] (03PS1) 10Elukey: knative-serving: fix templating for memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/715053 (https://phabricator.wikimedia.org/T278194) [16:22:59] 10SRE, 10Traffic, 10vm-requests: Please create a Ganeti VM for durum in eqiad - https://phabricator.wikimedia.org/T289693 (10Dzahn) 05Open→03Resolved The VM has been created and is up and running. [16:23:39] 10SRE, 10Traffic, 10vm-requests: Please create a Ganeti VM for durum in eqiad - https://phabricator.wikimedia.org/T289693 (10ssingh) >>! In T289693#7312422, @Dzahn wrote: > The VM has been created and is up and running. Yes, thanks, sorry, should have updated the ticket! [16:24:00] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:24:27] (03CR) 10Elukey: [C: 03+2] knative-serving: fix templating for memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/715053 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [16:24:34] (03CR) 10David Caro: P:standard: move admin to its own profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715003 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [16:25:26] (03CR) 10Andrew Bogott: [C: 03+2] nova vendordata: try to have cloud-init perform the first puppet run [puppet] - 10https://gerrit.wikimedia.org/r/715045 (https://phabricator.wikimedia.org/T289663) (owner: 10Andrew Bogott) [16:26:47] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [16:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:07] (03PS1) 10JMeybohm: rsyslog/kubernetes: Fix name in rsyslog action [puppet] - 10https://gerrit.wikimedia.org/r/715054 (https://phabricator.wikimedia.org/T289766) [16:27:08] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:27:08] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [16:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:44] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30878/console" [puppet] - 10https://gerrit.wikimedia.org/r/714962 (https://phabricator.wikimedia.org/T288345) (owner: 10JMeybohm) [16:28:54] (03CR) 10Vgutierrez: [C: 03+1] acme_chief: authorize durum1001 host for durum [puppet] - 10https://gerrit.wikimedia.org/r/715029 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [16:29:00] (03CR) 10JMeybohm: [C: 03+2] rsyslog/kubernetes: Fix name in rsyslog action [puppet] - 10https://gerrit.wikimedia.org/r/715054 (https://phabricator.wikimedia.org/T289766) (owner: 10JMeybohm) [16:30:41] (03PS2) 10Ssingh: acme_chief: authorize durum1001 host for durum [puppet] - 10https://gerrit.wikimedia.org/r/715029 (https://phabricator.wikimedia.org/T289536) [16:31:35] (03CR) 10Ssingh: [C: 03+2] acme_chief: authorize durum1001 host for durum [puppet] - 10https://gerrit.wikimedia.org/r/715029 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [16:31:39] (03CR) 10Hnowlan: osm: migrate cron osm_sync_lag to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713087 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [16:34:25] (03CR) 10Hnowlan: osm: migrate cron osm_sync_lag to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713087 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [16:34:46] 10SRE, 10Traffic, 10PM: Clean up Traffic tag/workboard - https://phabricator.wikimedia.org/T289787 (10BBlack) >>! In T289787#7312331, @Ladsgroup wrote: > I'm sorry if it sounds stupid or you already considered it but for the sake of being consistent with most other teams. You can have `Traffic-team` for tra... [16:38:04] (03PS2) 10Urbanecm: PageStore: Pass query flags to getPageById() too [core] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714864 (https://phabricator.wikimedia.org/T289717) (owner: 10Ahmon Dancy) [16:39:05] (03PS3) 10Ssingh: durum: intial commit [puppet] - 10https://gerrit.wikimedia.org/r/715038 (https://phabricator.wikimedia.org/T289536) [16:42:46] (03PS1) 10Andrew Bogott: Revert "nova vendordata: try to have cloud-init perform the first puppet run" [puppet] - 10https://gerrit.wikimedia.org/r/714865 [16:43:26] (03CR) 10jerkins-bot: [V: 04-1] Revert "nova vendordata: try to have cloud-init perform the first puppet run" [puppet] - 10https://gerrit.wikimedia.org/r/714865 (owner: 10Andrew Bogott) [16:43:38] 10SRE, 10ops-eqiad, 10 Data-Engineering, 10Analytics-Clusters, 10DC-Ops: (Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10odimitrijevic) [16:44:36] (03CR) 10Jbond: [V: 03+1] "not sure why pcc doesn't see this fact but see inline noticed something else which we may want to consider?" [puppet] - 10https://gerrit.wikimedia.org/r/714962 (https://phabricator.wikimedia.org/T288345) (owner: 10JMeybohm) [16:45:29] (03PS2) 10Andrew Bogott: Revert "nova vendordata: try to have cloud-init perform the first puppet run" [puppet] - 10https://gerrit.wikimedia.org/r/714865 [16:46:15] (03CR) 10Andrew Bogott: [C: 03+2] Revert "nova vendordata: try to have cloud-init perform the first puppet run" [puppet] - 10https://gerrit.wikimedia.org/r/714865 (owner: 10Andrew Bogott) [16:46:48] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:48:18] (03PS6) 10Jbond: kubernetes::node: Make use of the disk_type fact [puppet] - 10https://gerrit.wikimedia.org/r/714962 (https://phabricator.wikimedia.org/T288345) (owner: 10JMeybohm) [16:48:51] (03PS4) 10Ssingh: durum: intial commit [puppet] - 10https://gerrit.wikimedia.org/r/715038 (https://phabricator.wikimedia.org/T289536) [16:49:39] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30880/console" [puppet] - 10https://gerrit.wikimedia.org/r/714962 (https://phabricator.wikimedia.org/T288345) (owner: 10JMeybohm) [16:50:14] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:50:21] (03CR) 10Hnowlan: Helmfile for image suggestion api (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/697733 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui) [16:52:00] (03PS7) 10Jbond: kubernetes::node: Make use of the disk_type fact [puppet] - 10https://gerrit.wikimedia.org/r/714962 (https://phabricator.wikimedia.org/T288345) (owner: 10JMeybohm) [16:52:16] (03PS5) 10Ssingh: durum: intial commit [puppet] - 10https://gerrit.wikimedia.org/r/715038 (https://phabricator.wikimedia.org/T289536) [16:53:54] (03PS4) 10Zabe: osm: migrate cron osm_sync_lag to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/713087 (https://phabricator.wikimedia.org/T273673) [16:54:18] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289783 (10jcrespo) Hi, Adam, can you formaly confirm the signed contractual relationship with WMF @dr0ptp4kt, sadly contractors don't appear on namely or ldap-crop, so we have to ask the manager. Whe... [16:55:29] (03CR) 10Zabe: osm: migrate cron osm_sync_lag to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713087 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [16:56:02] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289783 (10jcrespo) p:05Triage→03High a:05jcrespo→03dr0ptp4kt [16:56:10] (03PS8) 10Jbond: kubernetes::node: Make use of the disk_type fact [puppet] - 10https://gerrit.wikimedia.org/r/714962 (https://phabricator.wikimedia.org/T288345) (owner: 10JMeybohm) [16:56:14] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1002/30882/durum1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/715038 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [16:57:52] (03PS9) 10Jbond: kubernetes::node: Make use of the disk_type fact [puppet] - 10https://gerrit.wikimedia.org/r/714962 (https://phabricator.wikimedia.org/T288345) (owner: 10JMeybohm) [16:59:46] urbanecm: Are handling https://gerrit.wikimedia.org/r/c/mediawiki/core/+/714864 now? [17:00:05] chrisalbon and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210826T1700). [17:00:05] dancy: no, i only added the hash to commit msg [17:00:12] ok [17:00:16] * urbanecm heads into a meeting, too, so no time for that now [17:00:31] I'll take care of it. [17:00:34] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289783 (10jcrespo) User account "SimoneThisDot" is not registered. on wikitech, I am assuming you mean the one on your Phab profile: "Simone Cuomo" :-) [17:01:19] urbanecm: btw, I thought Gerrit would add that extra bit to the commit message by itself. Annoying. [17:01:40] dancy: it does, but the cherry-picked commit has to be merged [17:01:55] (it doesn't have a (permanent git) hash before it is merged, only change-id) [17:02:07] ah, I see. [17:02:12] I cherry-picked too early. [17:02:22] (03CR) 10Ahmon Dancy: [C: 03+2] PageStore: Pass query flags to getPageById() too [core] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714864 (https://phabricator.wikimedia.org/T289717) (owner: 10Ahmon Dancy) [17:02:40] (03PS10) 10Jbond: kubernetes::node: Make use of the disk_type fact [puppet] - 10https://gerrit.wikimedia.org/r/714962 (https://phabricator.wikimedia.org/T288345) (owner: 10JMeybohm) [17:06:52] (03PS11) 10Jbond: kubernetes::node: Make use of the disk_type fact [puppet] - 10https://gerrit.wikimedia.org/r/714962 (https://phabricator.wikimedia.org/T288345) (owner: 10JMeybohm) [17:09:37] (03PS12) 10Jbond: kubernetes::node: Make use of the disk_type fact [puppet] - 10https://gerrit.wikimedia.org/r/714962 (https://phabricator.wikimedia.org/T288345) (owner: 10JMeybohm) [17:09:53] 10SRE, 10Traffic, 10PM: Clean up Traffic tag/workboard - https://phabricator.wikimedia.org/T289787 (10Ladsgroup) Sure. Let me know if I can help with anything! [17:10:56] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:11:15] (03PS13) 10Jbond: kubernetes::node: Make use of the disk_type fact [puppet] - 10https://gerrit.wikimedia.org/r/714962 (https://phabricator.wikimedia.org/T288345) (owner: 10JMeybohm) [17:12:31] (03PS14) 10Jbond: kubernetes::node: Make use of the disk_type fact [puppet] - 10https://gerrit.wikimedia.org/r/714962 (https://phabricator.wikimedia.org/T288345) (owner: 10JMeybohm) [17:17:26] PROBLEM - Long running screen/tmux on maps2007 is CRITICAL: CRIT: Long running SCREEN process. (user: root PID: 15500, 1729501s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [17:18:31] (03PS15) 10Jbond: kubernetes::node: Make use of the disk_type fact [puppet] - 10https://gerrit.wikimedia.org/r/714962 (https://phabricator.wikimedia.org/T288345) (owner: 10JMeybohm) [17:18:32] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:19:51] (03PS16) 10Jbond: kubernetes::node: Make use of the disk_type fact [puppet] - 10https://gerrit.wikimedia.org/r/714962 (https://phabricator.wikimedia.org/T288345) (owner: 10JMeybohm) [17:20:43] (03PS17) 10Jbond: kubernetes::node: Make use of the disk_type fact [puppet] - 10https://gerrit.wikimedia.org/r/714962 (https://phabricator.wikimedia.org/T288345) (owner: 10JMeybohm) [17:21:51] 10SRE, 10Release-Engineering-Team: Puppet failure on deployment-kafka-jumbo-3.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T289782 (10Majavah) [17:22:46] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:23:32] (03Merged) 10jenkins-bot: PageStore: Pass query flags to getPageById() too [core] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714864 (https://phabricator.wikimedia.org/T289717) (owner: 10Ahmon Dancy) [17:23:50] Deploying ^^^ now. [17:24:22] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:24:40] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:26:26] !log dancy@deploy1002 Synchronized php-1.37.0-wmf.20/includes/page/PageStore.php: Backport: [[gerrit:714864|PageStore: Pass query flags to getPageById() too (T289717 T195069)]] (duration: 01m 05s) [17:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:32] T289717: Wikimedia\Assert\PostconditionException: Postcondition failed: Revision had no page - https://phabricator.wikimedia.org/T289717 [17:26:32] T195069: Factor PageStore and PageRecord out of WikiPage - https://phabricator.wikimedia.org/T195069 [17:27:11] And rolling the train to group1 to see how things look [17:27:28] (03PS1) 10Ahmon Dancy: group1 wikis to 1.37.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715059 [17:27:30] (03CR) 10Ahmon Dancy: [C: 03+2] group1 wikis to 1.37.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715059 (owner: 10Ahmon Dancy) [17:28:29] (03Merged) 10jenkins-bot: group1 wikis to 1.37.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715059 (owner: 10Ahmon Dancy) [17:29:41] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.20 [17:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:30:04] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:30:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:47] !log dancy@deploy1002 Synchronized php: group1 wikis to 1.37.0-wmf.20 (duration: 01m 05s) [17:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:52] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:37:11] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289783 (10dr0ptp4kt) @jcrespo Yes, confirmed on the contract with This Dot (Simone is a This Dot consultant working on WMF projects). [17:38:19] (03CR) 10Jbond: "just adding some notes" [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [17:38:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:44] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:39:47] (03CR) 10Jbond: P:standard: move admin to its own profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715003 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [17:40:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:58] 10SRE, 10SRE-swift-storage, 10MediaWiki-extensions-Score, 10Performance-Team (Radar): Add cache key information to metadata json - https://phabricator.wikimedia.org/T257093 (10Reedy) >So when lilypond is upgraded, files are regenerated, and the old files stay on disk forever (currently), with no easy way t... [17:41:13] (03PS6) 10Ssingh: durum: intial commit [puppet] - 10https://gerrit.wikimedia.org/r/715038 (https://phabricator.wikimedia.org/T289536) [17:43:34] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:46:52] dancy: it's looking better now, right? i see no new errors since you deployed [17:47:04] Looks ok so far. [17:47:08] yeah, things seem pretty quiet [17:47:16] I'm going to let it marinate for an hour or so, then roll forward to group2. [17:47:28] sounds good [17:47:34] great, thanks [17:48:40] ac [17:51:29] (03PS6) 10Dduvall: gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504) [17:54:08] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/715043 (owner: 10Volans) [18:00:05] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210826T1800) [18:00:05] mforns and nn1l2: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:22] o/ [18:00:27] I can deploy today [18:00:28] hey :] [18:00:31] Hello [18:00:44] (unless mforns would prefer to self-service :)) [18:00:55] no no, please go ahear [18:00:57] *ahead [18:01:00] (03PS2) 10Urbanecm: Finalize Event Platform migration of EchoEmail and EchoInteraction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714794 (https://phabricator.wikimedia.org/T287210) (owner: 10Mforns) [18:01:03] (03CR) 10Urbanecm: [C: 03+2] Finalize Event Platform migration of EchoEmail and EchoInteraction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714794 (https://phabricator.wikimedia.org/T287210) (owner: 10Mforns) [18:01:27] can we go ahead with 714872? [18:01:35] (03PS7) 10Dduvall: gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504) [18:01:44] urbanecm: a note: as this is a change that just removes unused config, I'm not sure how to test it [18:01:52] nn1l2: sure, I'm processing the deployments in the order they're in the calendar 🙂 [18:01:56] (03Merged) 10jenkins-bot: Finalize Event Platform migration of EchoEmail and EchoInteraction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714794 (https://phabricator.wikimedia.org/T287210) (owner: 10Mforns) [18:02:00] i'll ping you once it will be your time! [18:02:10] mforns: thanks for noting that. If it's unused, i'll just sync it out [18:02:18] ok, thanks! [18:03:00] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:03:24] nn1l2: do you have a test page prepared with quiz syntax to test your patch please? [18:03:38] (just a simple quiz copied from another wiki should be enough) [18:03:51] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: d4340e9c18468d14885c8ced87f1e014a3481f2a: Finalize Event Platform migration of EchoEmail and EchoInteraction (T287210) (duration: 01m 07s) [18:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:55] T287210: EchoMail and EchoInteraction Event Platform Migration - https://phabricator.wikimedia.org/T287210 [18:04:09] yes [18:04:10] mforns: should be live! [18:04:16] urbanecm: thanks a lot! [18:04:21] https://fa.wikibooks.org/wiki/%DA%A9%D8%A7%D8%B1%D8%A8%D8%B1:4nn1l2/%D8%AA%D9%85%D8%B1%DB%8C%D9%86 [18:04:34] (03PS2) 10Urbanecm: Install Extension Quiz on fa.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714872 (https://phabricator.wikimedia.org/T289381) (owner: 104nn1l2) [18:04:38] (03CR) 10Jbond: [C: 03+1] "the code looks fine to me but not tested" [software] - 10https://gerrit.wikimedia.org/r/715025 (https://phabricator.wikimedia.org/T288354) (owner: 10Volans) [18:04:47] (03CR) 10Urbanecm: [C: 03+2] Install Extension Quiz on fa.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714872 (https://phabricator.wikimedia.org/T289381) (owner: 104nn1l2) [18:06:14] (03Merged) 10jenkins-bot: Install Extension Quiz on fa.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714872 (https://phabricator.wikimedia.org/T289381) (owner: 104nn1l2) [18:06:36] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:06:45] nn1l2: thanks, that should work. Do you know how to use https://wikitech.wikimedia.org/wiki/WikimediaDebug#Browser_usage already? 🙂 [18:06:50] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:07:17] I have Debug installed, but forgot how to work with it [18:07:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:52] nn1l2: great. So, I've pulled your fa.wikibooks patch to mwdebug2001. Please open the extension, select that host in it, and click the off/on switch [18:07:59] then, go to the test page you prepared, and try to preview it [18:08:03] the quiz should magically work :) [18:08:12] (03PS8) 10Dduvall: gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504) [18:09:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:24] which server should I select [18:10:51] nn1l2: mwdebug2001 [18:11:08] (mwdebug2001.codfw.wmnet, if you want the full form) [18:11:56] Do you see any changes? [18:12:10] I still don't see any chnages in my sandbox page [18:12:24] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:12:54] nn1l2: you need to open the sandbox page for editing and preview it [18:13:25] yes [18:13:31] It's okay now [18:14:03] great [18:14:04] syncing [18:14:38] (03PS2) 10Urbanecm: Install Extension Quiz on ja.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714873 (https://phabricator.wikimedia.org/T289383) (owner: 104nn1l2) [18:14:44] (03CR) 10Urbanecm: [C: 03+2] Install Extension Quiz on ja.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714873 (https://phabricator.wikimedia.org/T289383) (owner: 104nn1l2) [18:15:14] Sandbox for the Japanese Wikibooks [18:15:14] https://ja.wikibooks.org/wiki/%E5%88%A9%E7%94%A8%E8%80%85:4nn1l2/Sandbox [18:15:32] (03Merged) 10jenkins-bot: Install Extension Quiz on ja.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714873 (https://phabricator.wikimedia.org/T289383) (owner: 104nn1l2) [18:15:35] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: cde88918b73628f2eaaff919ddb869b4dc2c93c6: Install Extension Quiz on fa.wikibooks (T289381) (duration: 01m 07s) [18:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:40] should be live [18:15:40] T289381: Install the Quiz extension on the Persian Wikibooks - https://phabricator.wikimedia.org/T289381 [18:15:47] (fawikibooks i mean) [18:15:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:03] nn1l2: the jawikibooks patch is at mwdebug2001 now [18:16:05] can you test, too? [18:16:30] !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on durum1001.eqiad.wmnet with reason: testing out durum [18:16:31] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on durum1001.eqiad.wmnet with reason: testing out durum [18:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:43] That works too :) [18:16:47] (03CR) 10Ssingh: "Going to merge this to test it out, will revert and then ask for code review." [puppet] - 10https://gerrit.wikimedia.org/r/715038 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [18:16:52] (03CR) 10Ssingh: [C: 03+2] durum: intial commit [puppet] - 10https://gerrit.wikimedia.org/r/715038 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [18:17:01] great, syncing [18:17:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:26] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 66717bc039f40336144dcc0dfd97ff5331b418e9: Install Extension Quiz on ja.wikibooks (T289383) (duration: 01m 05s) [18:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:30] T289383: Install the Quiz extension on the Japanese Wikibooks - https://phabricator.wikimedia.org/T289383 [18:19:31] nn1l2: and live, too [18:19:34] anything else i can help with? [18:20:48] (03PS1) 10Ssingh: durum: fix notify for uWSGI service [puppet] - 10https://gerrit.wikimedia.org/r/715065 [18:21:56] Sorry, but I still don't see any changes in my Farsi sandbox in the view mode (not the edit mode), even after making a dummy edit [18:22:16] let me check [18:22:21] (03CR) 10Ssingh: [C: 03+2] durum: fix notify for uWSGI service [puppet] - 10https://gerrit.wikimedia.org/r/715065 (owner: 10Ssingh) [18:23:16] it works in preview, so it has to be some sort of cache [18:23:25] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:23:44] and copying your sandbox to my sandbox also works: https://fa.wikibooks.org/wiki/%DA%A9%D8%A7%D8%B1%D8%A8%D8%B1:Martin_Urbanec/sand [18:24:27] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:24:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:45] nn1l2: now it works [18:25:03] Yes, confirmed [18:25:06] great [18:26:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:44] (03PS1) 10Ssingh: durum: fix requirements for durum.py [puppet] - 10https://gerrit.wikimedia.org/r/715066 [18:27:29] The Japanese works too :) [18:28:03] great! [18:28:22] (03CR) 10Ssingh: [C: 03+2] durum: fix requirements for durum.py [puppet] - 10https://gerrit.wikimedia.org/r/715066 (owner: 10Ssingh) [18:28:43] I think I'm dome here. Can I leave? [18:28:48] *done [18:28:51] sure [18:28:58] talk to you later :) [18:30:35] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 80, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:30:49] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:33:31] (03PS9) 10Dduvall: gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504) [18:34:27] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:35:09] (03PS10) 10Dduvall: gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504) [18:37:36] (03PS1) 10QChris: Add .gitreview [debs/python-eventlet] - 10https://gerrit.wikimedia.org/r/715067 [18:37:38] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [debs/python-eventlet] - 10https://gerrit.wikimedia.org/r/715067 (owner: 10QChris) [18:46:35] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:48:17] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:50:07] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:50:36] 10SRE, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad unresponsive - https://phabricator.wikimedia.org/T175625 (10Cmjohnson) [18:50:48] 10SRE, 10Analytics: Remove fdans from analytics-alerts mailing list - https://phabricator.wikimedia.org/T289807 (10JAllemandou) [18:51:09] 10SRE, 10ops-eqiad, 10DC-Ops: document all scs connections - https://phabricator.wikimedia.org/T175876 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson All the connections have been documented and labels updated [18:51:53] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:54:26] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:49] (03PS1) 10Ssingh: durum: update nginx.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/715069 [18:57:17] (03CR) 10jerkins-bot: [V: 04-1] durum: update nginx.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/715069 (owner: 10Ssingh) [18:58:25] (03PS2) 10Ssingh: durum: update nginx.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/715069 [18:59:17] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:41] Rolling the train forward to group2 [18:59:56] (03PS1) 10Ahmon Dancy: group2 wikis to 1.37.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715070 [18:59:58] (03CR) 10Ahmon Dancy: [C: 03+2] group2 wikis to 1.37.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715070 (owner: 10Ahmon Dancy) [19:00:00] (03CR) 10Ssingh: [C: 03+2] durum: update nginx.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/715069 (owner: 10Ssingh) [19:00:04] dancy and brennen: May I have your attention please! MediaWiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210826T1900) [19:00:17] 'merican [19:00:37] as mom and apple pie [19:00:40] or something like that [19:00:49] (03Merged) 10jenkins-bot: group2 wikis to 1.37.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715070 (owner: 10Ahmon Dancy) [19:02:20] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.37.0-wmf.20 [19:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:43] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:06:46] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10Cmjohnson) [19:08:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:23] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:09:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:21] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:15:11] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:26:49] (03PS1) 10Ssingh: hiera: update hieradata for durum.yaml [puppet] - 10https://gerrit.wikimedia.org/r/715074 [19:27:11] (03PS3) 10Herron: Add Varnish SLO dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/713440 (https://phabricator.wikimedia.org/T289036) (owner: 10Ema) [19:27:23] (03CR) 10Ssingh: [C: 03+2] hiera: update hieradata for durum.yaml [puppet] - 10https://gerrit.wikimedia.org/r/715074 (owner: 10Ssingh) [19:31:47] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:33:39] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:34:29] (03CR) 10Herron: "I added latency placeholder values, and rebased as well which required adding a request_sli_query, HTH!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/713440 (https://phabricator.wikimedia.org/T289036) (owner: 10Ema) [19:40:37] (03CR) 10Herron: [C: 03+1] o11y: add prometheus alerts [alerts] - 10https://gerrit.wikimedia.org/r/715032 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [19:41:06] (03PS1) 10Ssingh: durum: add CORS header to nginx.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/715078 [19:43:53] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1003/30886/durum1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/715078 (owner: 10Ssingh) [19:43:55] (03CR) 10Ssingh: [C: 03+2] durum: add CORS header to nginx.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/715078 (owner: 10Ssingh) [19:46:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:(Need By: TBD) rack/setup/install puppetmaster100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T289732 (10RobH) [19:46:39] 10SRE, 10ops-eqiad, 10 Data-Engineering, 10Analytics-Clusters, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10RobH) [19:46:49] 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10RobH) [19:46:54] 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10RobH) [19:47:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q1:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10RobH) [19:47:07] 10SRE, 10ops-eqiad: Q1:(Need by: 2020-06-30) replace scs-a8-eqiad - https://phabricator.wikimedia.org/T228919 (10RobH) [19:51:05] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [19:52:38] 10SRE, 10ops-eqiad, 10DC-Ops: Netbox Duplicate Cable IDs & Accounting Discrepancies - https://phabricator.wikimedia.org/T285719 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson [19:52:59] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [19:53:57] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:53:57] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:55:01] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:58:15] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:58:47] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:02:23] (03CR) 10Herron: "I'm torn on naming between 'aka' and 'aliases' but LGTM overall. I'll defer to service ops for +1s" [puppet] - 10https://gerrit.wikimedia.org/r/714965 (owner: 10Filippo Giunchedi) [20:03:57] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:10:30] anybody have thoughts on T289792? [20:10:31] T289792: Usage and linking of a page or file through a redirect is not reported by API query for linkshere and fileusage - https://phabricator.wikimedia.org/T289792 [20:11:33] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:13:29] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:18:05] (03PS1) 10Cwhite: profile: adapt alertmanager-webhook-logger to ECS [puppet] - 10https://gerrit.wikimedia.org/r/715111 (https://phabricator.wikimedia.org/T289356) [20:36:12] 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install - https://phabricator.wikimedia.org/T289812 (10RobH) [20:36:32] 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install - https://phabricator.wikimedia.org/T289812 (10RobH) [20:36:54] 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup (4) fundraising hosts - https://phabricator.wikimedia.org/T289812 (10RobH) [20:37:42] 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup (4) fundraising hosts - https://phabricator.wikimedia.org/T289812 (10RobH) [20:38:17] 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:(Need By: TBD) rack/setup (4) fundraising hosts - https://phabricator.wikimedia.org/T289812 (10RobH) a:03Jclark-ctr [20:38:23] 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:(Need By: TBD) rack/setup (4) fundraising hosts - https://phabricator.wikimedia.org/T289812 (10RobH) [20:40:07] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:45:29] brennen: it would be good to have confirmation that's an issue on real Commons and not just Beta... [20:45:51] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:47:37] legoktm: yeah. per James_F feedback on-task, have removed as a blocker for the time being. [20:47:54] I disagree with James on the severity, but I just tried it on Commons and it works fine [20:50:03] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:55:39] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:06:35] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:08:27] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:15:27] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289783 (10jcrespo) a:05dr0ptp4kt→03jcrespo [21:18:01] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:33:15] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:38:57] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:40:51] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:58:03] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:01:51] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:06:07] !log restarted mailman3-web on lists1001 (T289798) [22:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:57] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:22:49] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:24:45] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:28:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10wiki_willy) [22:29:15] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10wiki_willy) [22:30:20] 10SRE, 10ops-eqiad: Q1 '19:(Need by: 2020-06-30) replace scs-a8-eqiad - https://phabricator.wikimedia.org/T228919 (10wiki_willy) [22:40:20] (03PS11) 10Dduvall: gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504) [22:40:22] (03PS1) 10Dduvall: aptrepo: Add gitlab-runner repo mirror [puppet] - 10https://gerrit.wikimedia.org/r/715134 (https://phabricator.wikimedia.org/T287504) [22:55:33] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.1864 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [23:00:05] brennen: I, the Bot under the Fountain, call upon thee, The Deployer, to do US Backport and Config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210826T2300). [23:00:25] * thcipriani waves [23:00:52] * Platonides waves back to thcipriani [23:05:01] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.1429 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [23:07:05] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:07:39] * xSavitar waves [23:07:53] (Primary inbound port utilisation over 80% #page) firing: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [23:07:53] (Primary inbound port utilisation over 80% #page) firing: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [23:08:06] * legoktm looks [23:08:27] mr1-esams.wikimedia.org [23:08:39] XioNoX: another false positive? [23:08:51] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.1053 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [23:08:59] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:12:53] (Primary inbound port utilisation over 80% #page) resolved: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [23:12:53] (Primary inbound port utilisation over 80% #page) resolved: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [23:13:45] I think it was, but I'll file a task just in case [23:14:11] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:14:41] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:16:52] T289820 [23:16:53] T289820: 2021-08-26 Primary inbound port utilisation over 80% page for mr1-esams.wikimedia.org - https://phabricator.wikimedia.org/T289820 [23:18:03] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:24:05] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.1429 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [23:25:41] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:27:35] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:27:51] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.2273 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [23:28:24] oops sorry I was afk, thanks legoktm [23:28:48] np :) [23:33:37] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.1216 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [23:39:05] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:40:59] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:41:13] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.1111 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [23:45:01] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.1233 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [23:49:07] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:50:43] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.1724 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [23:51:01] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status