[00:30:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:47:57] PROBLEM - Check systemd state on ms-be2041 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:51:11] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:53:07] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:08:17] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2041 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [02:14:17] RECOVERY - Check systemd state on ms-be2041 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:39:21] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2041 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [02:57:28] 10SRE, 10Analytics-Radar, 10Traffic, 10WMF-General-or-Unknown, 10Performance-Team (Radar): Requests for /static get an invalid WMF-Last-Access cookie for wikipedia.org on non-Wikipedia requests - https://phabricator.wikimedia.org/T261803 (10AntiCompositeNumber) It's not just `/static`, JavaScript and CSS... [03:50:13] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:52:19] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:47:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:49:47] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:06:32] (03CR) 10Marostegui: "Thanks a lot Jaime" [puppet] - 10https://gerrit.wikimedia.org/r/726857 (owner: 10Jcrespo) [05:10:47] (03PS1) 10Marostegui: Revert "clouddb1020: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/731232 [05:12:03] (03CR) 10Marostegui: [C: 03+2] Revert "clouddb1020: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/731232 (owner: 10Marostegui) [05:12:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team (Hardware): hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnet - https://phabricator.wikimedia.org/T291963 (10Marostegui) Enabled notifications for this host. [05:18:44] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1126 - https://phabricator.wikimedia.org/T292325 (10Marostegui) [05:21:59] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:25:57] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:50:09] 10SRE, 10Wikimedia-Mailing-lists: "Oversight-wp-pt" management - https://phabricator.wikimedia.org/T293592 (10Ladsgroup) The owner doesn't seem to be an oversight anymore https://pt.wikipedia.org/wiki/Especial:Lista_de_utilizadores/oversight I can't disclose its members or the owner. Some look correct, some d... [05:54:05] PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:54:48] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: "Oversight-wp-pt" management - https://phabricator.wikimedia.org/T293592 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup I made the email address attached to your wiki account as an owner of that mailing list. You can make an account in lists.wikimedia.... [06:00:57] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: actually load the kafka module in rsyslog [deployment-charts] - 10https://gerrit.wikimedia.org/r/731123 (owner: 10Giuseppe Lavagetto) [06:04:55] (03Merged) 10jenkins-bot: mediawiki: actually load the kafka module in rsyslog [deployment-charts] - 10https://gerrit.wikimedia.org/r/731123 (owner: 10Giuseppe Lavagetto) [06:09:51] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [06:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:39] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [06:16:17] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [06:17:43] RECOVERY - etcd request latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [06:18:21] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [06:18:33] 10SRE, 10DBA, 10Data-Persistence (Consultation), 10Sustainability (Incident Followup): Reimplement HHVM-like slow query log - https://phabricator.wikimedia.org/T293534 (10Marostegui) At the current query rate we cannot really enable the slow query log on our hosts so this would need to be done in a differe... [06:18:42] 10SRE, 10Data-Persistence (Consultation), 10Sustainability (Incident Followup): Reimplement HHVM-like slow query log - https://phabricator.wikimedia.org/T293534 (10Marostegui) [06:31:02] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [06:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:59] RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:58:58] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10HTTPS: Beta cluster certificates have expired (September 2020) - https://phabricator.wikimedia.org/T262806 (10Aklapper) [07:01:51] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:00] (03CR) 10Muehlenhoff: [C: 03+2] Switch to puppet-generated contacts file [puppet] - 10https://gerrit.wikimedia.org/r/728318 (owner: 10Muehlenhoff) [07:10:11] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=DELETE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:12:15] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:17:17] (03PS1) 10Muehlenhoff: Remove now obsolete owners.yaml file [puppet] - 10https://gerrit.wikimedia.org/r/731321 [07:25:45] (03CR) 10Muehlenhoff: [C: 03+2] Remove now obsolete owners.yaml file [puppet] - 10https://gerrit.wikimedia.org/r/731321 (owner: 10Muehlenhoff) [07:29:50] 10SRE, 10MediaWiki-extensions-CentralNotice, 10MediaWiki-extensions-Translate, 10Wikimedia-Fundraising, and 7 others: DBPerformance warning "Query returned XXXX rows: query: SELECT * FROM `translate_metadata`" - https://phabricator.wikimedia.org/T204026 (10Nikerabbit) 05In progress→03Resolved Thank you... [07:30:29] 10SRE, 10MediaWiki-extensions-CentralNotice, 10MediaWiki-extensions-Translate, 10Wikimedia-Fundraising, and 7 others: DBPerformance warning "Query returned XXXX rows: query: SELECT * FROM `translate_metadata`" - https://phabricator.wikimedia.org/T204026 (10Nikerabbit) [07:33:41] 10SRE, 10Traffic, 10Performance-Team (Radar), 10User-ema: Package and deploy Varnish 6.0.8 - https://phabricator.wikimedia.org/T292290 (10ema) >>! In T292290#7418496, @Krinkle wrote: > I've made some improvements to the by-host dash that may be of use: > !log depool + restart blazegraph on wdqs1013 [07:34:03] !log cp3060 (text), cp3061 (upload): upgrade varnish to 6.0.8 T292290 [07:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:13] T292290: Package and deploy Varnish 6.0.8 - https://phabricator.wikimedia.org/T292290 [07:43:23] 10SRE, 10DBA, 10Sustainability (Incident Followup): Lower automatic query killing threshold to 55 seconds - https://phabricator.wikimedia.org/T293533 (10Marostegui) I have nothing against this but it is a quite massive task as we need to drop and recreate the query killer across all hosts (without replicatio... [07:44:09] 10SRE, 10DBA, 10Sustainability (Incident Followup): Lower automatic query killing threshold to 55 seconds - https://phabricator.wikimedia.org/T293533 (10Marostegui) p:05Triage→03Medium [07:45:48] (03CR) 10Muehlenhoff: [C: 03+2] Add more role contacts [puppet] - 10https://gerrit.wikimedia.org/r/731095 (owner: 10Muehlenhoff) [07:58:18] (03CR) 10MMandere: [C: 03+2] prometheus: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/730793 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [08:00:29] (03Abandoned) 10David Caro: remote: use only the last line for the uptime [software/spicerack] - 10https://gerrit.wikimedia.org/r/730270 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [08:03:35] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [08:05:39] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [08:10:52] (03PS2) 10Vgutierrez: systemd: Allow paging on a systemd::service failure [puppet] - 10https://gerrit.wikimedia.org/r/731101 (https://phabricator.wikimedia.org/T292619) [08:11:07] (03CR) 10Vgutierrez: systemd: Allow paging on a systemd::service failure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/731101 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [08:16:16] (03CR) 10Ema: [C: 03+1] systemd: Allow paging on a systemd::service failure [puppet] - 10https://gerrit.wikimedia.org/r/731101 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [08:17:13] (03CR) 10Vgutierrez: [C: 03+2] systemd: Allow paging on a systemd::service failure [puppet] - 10https://gerrit.wikimedia.org/r/731101 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [08:20:56] (03PS1) 10Elukey: profile::prometheus::alerts: update kafka mirror maker settings [puppet] - 10https://gerrit.wikimedia.org/r/731332 [08:26:02] (03CR) 10Filippo Giunchedi: [C: 03+2] graphite: expire metric files not updated for 3y [puppet] - 10https://gerrit.wikimedia.org/r/730427 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [08:26:07] (03PS2) 10Filippo Giunchedi: graphite: expire metric files not updated for 3y [puppet] - 10https://gerrit.wikimedia.org/r/730427 (https://phabricator.wikimedia.org/T247963) [08:26:27] (03PS2) 10Elukey: profile::prometheus::alerts: update kafka mirror maker settings [puppet] - 10https://gerrit.wikimedia.org/r/731332 [08:27:28] (03CR) 10Elukey: "Andrew if you have moment lemme know if this makes sense, or if I need to drink more coffee :D" [puppet] - 10https://gerrit.wikimedia.org/r/731332 (owner: 10Elukey) [08:44:28] (03CR) 10Michael Große: [C: 03+1] Unconditionally enable Wikibase dispatching via jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731014 (https://phabricator.wikimedia.org/T291828) (owner: 10Lucas Werkmeister (WMDE)) [08:44:35] (03CR) 10Michael Große: [C: 03+1] Remove wmg variables for dispatch via jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731015 (https://phabricator.wikimedia.org/T291828) (owner: 10Lucas Werkmeister (WMDE)) [08:49:36] 10SRE, 10DBA, 10Sustainability (Incident Followup): Improve automatic query killer under high load - https://phabricator.wikimedia.org/T293532 (10Marostegui) I wouldn't like to have an external process running all the time "just in case" and from what I have seen during outages, pt-kill struggles too when th... [08:57:03] !log cleanup graphite metrics not modified for >= ~3yr (1024 days) [08:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:53] (03PS1) 10Vgutierrez: acme_chief: Enable watchdog on production servers [puppet] - 10https://gerrit.wikimedia.org/r/731335 (https://phabricator.wikimedia.org/T292619) [09:06:13] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31730/console" [puppet] - 10https://gerrit.wikimedia.org/r/731335 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [09:13:14] !log installing apr security updates on bullseye [09:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:17] (03PS1) 10Muehlenhoff: Add library hint for apr [puppet] - 10https://gerrit.wikimedia.org/r/731338 [09:20:39] (03PS1) 10Lucas Werkmeister (WMDE): Don't filter by change Id when dispatching to client wikis [extensions/Wikibase] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/731237 [09:21:31] (03PS1) 10Kormat: mariadb: Add per-section alias [puppet] - 10https://gerrit.wikimedia.org/r/731339 (https://phabricator.wikimedia.org/T291352) [09:22:33] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31731/console" [puppet] - 10https://gerrit.wikimedia.org/r/731339 (https://phabricator.wikimedia.org/T291352) (owner: 10Kormat) [09:23:26] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for apr [puppet] - 10https://gerrit.wikimedia.org/r/731338 (owner: 10Muehlenhoff) [09:25:04] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.1 point update - https://phabricator.wikimedia.org/T292844 (10MoritzMuehlenhoff) [09:25:20] (03PS2) 10Kormat: mariadb: Add per-section alias [puppet] - 10https://gerrit.wikimedia.org/r/731339 (https://phabricator.wikimedia.org/T291352) [09:25:30] (03PS1) 10Lucas Werkmeister (WMDE): Make deduplication actually work for DispatchChangesJob [extensions/Wikibase] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/731239 (https://phabricator.wikimedia.org/T291118) [09:26:08] (03PS2) 10Lucas Werkmeister (WMDE): Make deduplication actually work for DispatchChangesJob [extensions/Wikibase] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/731239 (https://phabricator.wikimedia.org/T291118) [09:26:10] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add per-section alias [puppet] - 10https://gerrit.wikimedia.org/r/731339 (https://phabricator.wikimedia.org/T291352) (owner: 10Kormat) [09:26:43] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31732/console" [puppet] - 10https://gerrit.wikimedia.org/r/731339 (https://phabricator.wikimedia.org/T291352) (owner: 10Kormat) [09:26:56] 10SRE, 10Traffic, 10User-ema: purged rdkafka crashes: assert: rkq->rkq_refcnt > 0 - https://phabricator.wikimedia.org/T293605 (10ema) [09:28:42] 10SRE, 10Traffic, 10User-ema: purged rdkafka crashes: assert: rkq->rkq_refcnt > 0 - https://phabricator.wikimedia.org/T293605 (10ema) p:05Triage→03Low Setting priority to low for now as these seem isolated, sporadic crashes and systemd took care of the restarts as expected so there was no production impact. [09:29:27] (03PS3) 10Kormat: mariadb: Add per-section alias [puppet] - 10https://gerrit.wikimedia.org/r/731339 (https://phabricator.wikimedia.org/T291352) [09:30:37] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31733/console" [puppet] - 10https://gerrit.wikimedia.org/r/731339 (https://phabricator.wikimedia.org/T291352) (owner: 10Kormat) [09:32:17] (03CR) 10Michael Große: [C: 03+1] Don't filter by change Id when dispatching to client wikis [extensions/Wikibase] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/731237 (owner: 10Lucas Werkmeister (WMDE)) [09:32:23] (03CR) 10Michael Große: [C: 03+1] Create DispatchChangesJob without change id [extensions/Wikibase] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/731238 (owner: 10Lucas Werkmeister (WMDE)) [09:32:29] (03CR) 10Michael Große: [C: 03+1] Make deduplication actually work for DispatchChangesJob [extensions/Wikibase] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/731239 (https://phabricator.wikimedia.org/T291118) (owner: 10Lucas Werkmeister (WMDE)) [09:32:38] (03PS4) 10Kormat: mariadb: Add per-section alias [puppet] - 10https://gerrit.wikimedia.org/r/731339 (https://phabricator.wikimedia.org/T291352) [09:33:34] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31734/console" [puppet] - 10https://gerrit.wikimedia.org/r/731339 (https://phabricator.wikimedia.org/T291352) (owner: 10Kormat) [09:37:49] (03PS5) 10Kormat: mariadb: Add per-section alias [puppet] - 10https://gerrit.wikimedia.org/r/731339 (https://phabricator.wikimedia.org/T291352) [09:38:04] !log sync metrics from graphite1004 to graphite2003 - T247963 [09:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:10] T247963: Migrate role::graphite::production to Bullseye - https://phabricator.wikimedia.org/T247963 [09:39:01] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31735/console" [puppet] - 10https://gerrit.wikimedia.org/r/731339 (https://phabricator.wikimedia.org/T291352) (owner: 10Kormat) [09:39:09] !log updating acme-chief to version 0.34 on acmechief instances - T292619 [09:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:15] T292619: Implement a watchdog mechanism on acme-chief - https://phabricator.wikimedia.org/T292619 [09:39:31] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] acme_chief: Enable watchdog on production servers [puppet] - 10https://gerrit.wikimedia.org/r/731335 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [09:44:21] (03PS6) 10Kormat: mariadb: Add per-section alias [puppet] - 10https://gerrit.wikimedia.org/r/731339 (https://phabricator.wikimedia.org/T291352) [09:46:43] (03PS7) 10Kormat: mariadb: Add per-section alias [puppet] - 10https://gerrit.wikimedia.org/r/731339 (https://phabricator.wikimedia.org/T291352) [09:48:06] (03PS1) 10Vgutierrez: acme_chief: Enable monitoring on systemd [puppet] - 10https://gerrit.wikimedia.org/r/731343 (https://phabricator.wikimedia.org/T292619) [09:48:37] (03PS8) 10Kormat: mariadb: Add per-section alias [puppet] - 10https://gerrit.wikimedia.org/r/731339 (https://phabricator.wikimedia.org/T291352) [09:48:58] !log installing node-tar security updates on buster [09:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:52] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31738/console" [puppet] - 10https://gerrit.wikimedia.org/r/731339 (https://phabricator.wikimedia.org/T291352) (owner: 10Kormat) [09:50:46] (03PS9) 10Kormat: mariadb: Add per-section alias [puppet] - 10https://gerrit.wikimedia.org/r/731339 (https://phabricator.wikimedia.org/T291352) [09:52:44] (03PS2) 10Vgutierrez: acme_chief: Enable monitoring on systemd [puppet] - 10https://gerrit.wikimedia.org/r/731343 (https://phabricator.wikimedia.org/T292619) [09:52:51] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31740/console" [puppet] - 10https://gerrit.wikimedia.org/r/731339 (https://phabricator.wikimedia.org/T291352) (owner: 10Kormat) [09:53:14] (03CR) 10jerkins-bot: [V: 04-1] acme_chief: Enable monitoring on systemd [puppet] - 10https://gerrit.wikimedia.org/r/731343 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [09:53:41] (03PS10) 10Jbond: standard::ntp: move standard ntp to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/730852 [09:54:02] (03PS5) 10Jbond: standard: remove standard module [puppet] - 10https://gerrit.wikimedia.org/r/730856 [09:54:22] (03PS3) 10Vgutierrez: acme_chief: Enable monitoring on systemd [puppet] - 10https://gerrit.wikimedia.org/r/731343 (https://phabricator.wikimedia.org/T292619) [09:57:12] 10SRE, 10SRE-swift-storage, 10Patch-For-Review: Spontaneous reboot of ms-be2045 - https://phabricator.wikimedia.org/T290881 (10MatthewVernon) 05Open→03Resolved Full weight restored, so closing this (again ;-) ) [09:57:18] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31741/console" [puppet] - 10https://gerrit.wikimedia.org/r/731343 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [09:57:33] jouncebot: nowandnext [09:57:33] No deployments scheduled for the next 1 hour(s) and 2 minute(s) [09:57:33] In 1 hour(s) and 2 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211018T1100) [09:57:52] if it’s okay with everyone I’d like to start early with some of the backports I scheduled for that window [09:57:59] (I’ll start in 5 minutes or so if nobody shouts at me) [09:59:36] (03CR) 10Jbond: [C: 03+2] standard::ntp: move standard ntp to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/730852 (owner: 10Jbond) [09:59:42] (03CR) 10Jbond: [C: 03+2] standard: remove standard module [puppet] - 10https://gerrit.wikimedia.org/r/730856 (owner: 10Jbond) [10:02:33] (03PS1) 10Phuedx: [beta] Rename $wgIPInfoGeoIP2Path to $wgIPInfoGeoIP2Prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731346 (https://phabricator.wikimedia.org/T289361) [10:06:05] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_translationnotifications-mediawikiwiki.service,mediawiki_job_translationnotifications-metawiki.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:59] (03PS46) 10Jbond: P:base: move production specific code to their own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [10:08:16] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Don't filter by change Id when dispatching to client wikis [extensions/Wikibase] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/731237 (owner: 10Lucas Werkmeister (WMDE)) [10:08:24] I’m starting my backports now [10:08:29] (03PS47) 10Jbond: P:base: move production specific code to their own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [10:09:07] (03PS1) 10David Caro: mariadb::packages: use ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/731348 (https://phabricator.wikimedia.org/T293604) [10:11:39] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31742/console" [puppet] - 10https://gerrit.wikimedia.org/r/731348 (https://phabricator.wikimedia.org/T293604) (owner: 10David Caro) [10:13:23] 10SRE-swift-storage, 10ops-codfw: moss-be2002.mgmt alert for DNS - https://phabricator.wikimedia.org/T293610 (10fgiunchedi) [10:13:45] (03CR) 10David Caro: [V: 03+1] "The PCC looks good, works on the changes 😊" [puppet] - 10https://gerrit.wikimedia.org/r/731348 (https://phabricator.wikimedia.org/T293604) (owner: 10David Caro) [10:19:19] 10SRE, 10Analytics-Radar, 10Data-Engineering, 10Event-Platform: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10jbond) >>! In T291905#7431136, @elukey wrote: > To recap the next steps: > * Add the cfssl CA cert to the base truststore of all jvms... [10:20:12] 10SRE, 10Analytics-Radar, 10Data-Engineering, 10Event-Platform: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10jbond) >>! In T291905#7431157, @Joe wrote: > For the record, we've created a `wmf-certificates` debian package that includes the pupp... [10:25:23] (03PS10) 10Kormat: mariadb: Add per-section alias [puppet] - 10https://gerrit.wikimedia.org/r/731339 (https://phabricator.wikimedia.org/T291352) [10:25:48] (03PS1) 10Jbond: P:pki: deploy certifcates via wmf-certificates [puppet] - 10https://gerrit.wikimedia.org/r/731350 (https://phabricator.wikimedia.org/T291905) [10:28:11] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31743/console" [puppet] - 10https://gerrit.wikimedia.org/r/731339 (https://phabricator.wikimedia.org/T291352) (owner: 10Kormat) [10:30:27] (03PS1) 10Giuseppe Lavagetto: mediawiki: open egress to kafka-logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/731353 [10:30:29] (03PS1) 10Giuseppe Lavagetto: mediawiki: improve rsyslog handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/731354 [10:31:37] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: improve rsyslog handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/731354 (owner: 10Giuseppe Lavagetto) [10:32:25] (03CR) 10Jbond: [C: 03+2] P:pki: deploy certifcates via wmf-certificates [puppet] - 10https://gerrit.wikimedia.org/r/731350 (https://phabricator.wikimedia.org/T291905) (owner: 10Jbond) [10:33:56] (03PS2) 10Giuseppe Lavagetto: mediawiki: improve rsyslog handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/731354 [10:34:24] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: improve rsyslog handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/731354 (owner: 10Giuseppe Lavagetto) [10:34:40] (03Merged) 10jenkins-bot: Don't filter by change Id when dispatching to client wikis [extensions/Wikibase] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/731237 (owner: 10Lucas Werkmeister (WMDE)) [10:35:09] alright [10:35:33] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: open egress to kafka-logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/731353 (owner: 10Giuseppe Lavagetto) [10:36:17] I don’t think there’s a way to test that first backport, I’ll just sync it [10:38:06] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.4/extensions/Wikibase/repo/: Backport: [[gerrit:731237|Don't filter by change Id when dispatching to client wikis ()]] (duration: 00m 59s) [10:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:37] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Create DispatchChangesJob without change id [extensions/Wikibase] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/731238 (owner: 10Lucas Werkmeister (WMDE)) [10:39:17] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.02323 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:39:35] (03Merged) 10jenkins-bot: mediawiki: open egress to kafka-logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/731353 (owner: 10Giuseppe Lavagetto) [10:39:57] ^ "E: Unable to locate package wmf-certificates" [10:40:03] re: puppet failures [10:40:52] (03CR) 10Marostegui: "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/731339 (https://phabricator.wikimedia.org/T291352) (owner: 10Kormat) [10:41:02] (03CR) 10Marostegui: [C: 03+1] mariadb: Add per-section alias [puppet] - 10https://gerrit.wikimedia.org/r/731339 (https://phabricator.wikimedia.org/T291352) (owner: 10Kormat) [10:41:35] moritzm, perhaps? [10:42:03] nope, jbond! [10:42:56] jbond: wmf-certificates package is not in apt for streetch, afaict [10:43:42] yeah, the affected hosts seem all to be stretch hosts [10:44:28] the rebel alliance [10:47:53] !log copied wmf-certificates from buster-wikimedia to stretch-wikimedia in reprepro [10:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:04] 10SRE, 10DBA, 10observability, 10Sustainability (Incident Followup): Monitor/dashboard number of queries killed by the automatic query killer - https://phabricator.wikimedia.org/T293531 (10Marostegui) p:05Triage→03Medium Probably grouping it by day is enough? Something like: ` root@db1169.eqiad.wmnet[o... [10:50:03] kormat, jbond: ^ recoveries should be incoming, doublechecked with a manual Puppet run on db2100 [10:51:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:22] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] acme_chief: Enable monitoring on systemd [puppet] - 10https://gerrit.wikimedia.org/r/731343 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [10:55:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:15] 10SRE, 10DBA, 10Platform Engineering, 10Sustainability (Incident Followup): Improve slow read query handling - https://phabricator.wikimedia.org/T293530 (10LSobanski) [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: Dear deployers, time to do the UTC morning backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211018T1100). [11:00:05] Lucas_WMDE: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:12] o/ [11:00:15] I’m already deploying [11:01:01] (03CR) 10Jbond: [C: 03+1] mariadb::packages: use ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/731348 (https://phabricator.wikimedia.org/T293604) (owner: 10David Caro) [11:01:18] Lucas_WMDE: happy to help e.g. with the mechanical backporting, if you want to focus on the jobrunner status etc. [11:02:54] thanks moritzm [11:03:21] awight: should be fine, but thanks [11:03:35] (03Merged) 10jenkins-bot: Create DispatchChangesJob without change id [extensions/Wikibase] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/731238 (owner: 10Lucas Werkmeister (WMDE)) [11:05:32] I’ll quickly test this one on mwdebug1001 [11:07:12] seems fine, let’s sync, one file at a time [11:07:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:17] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.4/extensions/Wikibase/repo/includes/ChangeModification/DispatchChangesJob.php: Backport: [[gerrit:731238|Create DispatchChangesJob without change id (T291118)]] (duration: 00m 56s) [11:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:23] T291118: Deduplicate based on entity id - https://phabricator.wikimedia.org/T291118 [11:09:44] (03PS1) 10Majavah: P::toolforge: fix ensure => absent for /srv/composer [puppet] - 10https://gerrit.wikimedia.org/r/731359 [11:09:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:15] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:10:33] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.4/extensions/Wikibase/repo/includes/Hooks/RecentChangeSaveHookHandler.php: Backport: [[gerrit:731238|Create DispatchChangesJob without change id (T291118)]] (2/2) (duration: 00m 56s) [11:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:02] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Make deduplication actually work for DispatchChangesJob [extensions/Wikibase] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/731239 (https://phabricator.wikimedia.org/T291118) (owner: 10Lucas Werkmeister (WMDE)) [11:11:24] (03CR) 10Arturo Borrero Gonzalez: "Thanks for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/731359 (owner: 10Majavah) [11:12:33] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:15:05] (03PS2) 10Majavah: P::toolforge: fix ensure => absent for /srv/composer [puppet] - 10https://gerrit.wikimedia.org/r/731359 [11:16:17] (03PS1) 10Arturo Borrero Gonzalez: hieradata: openstack: cinder-backups: fix ceph keyring file name [puppet] - 10https://gerrit.wikimedia.org/r/731370 (https://phabricator.wikimedia.org/T292546) [11:17:36] (03PS3) 10Giuseppe Lavagetto: mediawiki: improve rsyslog handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/731354 [11:17:38] (03PS1) 10Giuseppe Lavagetto: Rakefile: handle yaml errors where no fixtures are present. [deployment-charts] - 10https://gerrit.wikimedia.org/r/731372 [11:17:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1002/31744/" [puppet] - 10https://gerrit.wikimedia.org/r/731370 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [11:18:25] <_joe_> moritzm, jbond sorry wmf-certificates was only in buster, bullseye because that's where we needed it for containers :/ [11:19:13] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005666 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [11:19:56] _joe_: no problem i should have checked :) [11:21:42] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:22:38] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/731171 (https://phabricator.wikimedia.org/T285569) (owner: 10CDanis) [11:23:34] I filed a suggested production improvement at https://phabricator.wikimedia.org/T293614, not sure which tags to add to it [11:23:39] feel free to take a look :) [11:26:56] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.66 ms [11:27:43] 10Puppet, 10Infrastructure-Foundations, 10Sustainability: Enable bracketed-paste-mode for production shells (e.g. deployment, mwmaint) - https://phabricator.wikimedia.org/T293614 (10Lucas_Werkmeister_WMDE) Tentatively tagging Puppet since I assume that’s how such a change would be deployed. I’m not sure whic... [11:28:06] ah, Infrastructure-Foundations sounds like a reasonable tag, thanks Herald ^^ [11:31:25] (03PS1) 10Jbond: P:java: add Wikimedia_Internal_Root_CA to truststore [puppet] - 10https://gerrit.wikimedia.org/r/731374 (https://phabricator.wikimedia.org/T291905) [11:32:23] (03PS1) 10Arturo Borrero Gonzalez: hieradata: openstack: cinder-backups: fix permissions of ceph keyring file [puppet] - 10https://gerrit.wikimedia.org/r/731375 (https://phabricator.wikimedia.org/T292546) [11:33:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hieradata: openstack: cinder-backups: fix permissions of ceph keyring file [puppet] - 10https://gerrit.wikimedia.org/r/731375 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [11:34:20] (03Merged) 10jenkins-bot: Make deduplication actually work for DispatchChangesJob [extensions/Wikibase] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/731239 (https://phabricator.wikimedia.org/T291118) (owner: 10Lucas Werkmeister (WMDE)) [11:34:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P::toolforge: fix ensure => absent for /srv/composer [puppet] - 10https://gerrit.wikimedia.org/r/731359 (owner: 10Majavah) [11:35:25] and the final backport can’t be tested again, so I’m just syncing that [11:35:29] should be fine though [11:37:00] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.4/extensions/Wikibase/repo/includes/ChangeModification/DispatchChangesJob.php: Backport: [[gerrit:731239|Make deduplication actually work for DispatchChangesJob (T291118)]] (duration: 00m 55s) [11:37:03] (03CR) 10Jbond: [C: 03+2] P:java: add Wikimedia_Internal_Root_CA to truststore [puppet] - 10https://gerrit.wikimedia.org/r/731374 (https://phabricator.wikimedia.org/T291905) (owner: 10Jbond) [11:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:07] T291118: Deduplicate based on entity id - https://phabricator.wikimedia.org/T291118 [11:37:54] alright, that’s all the scheduled changes [11:38:05] but I might go ahead with some config changes (chain starting at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/730747) [11:38:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:39:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:30] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Set dispatchViaJobsAllowedClients to null everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730747 (https://phabricator.wikimedia.org/T291828) (owner: 10Lucas Werkmeister (WMDE)) [11:40:38] (03PS3) 10Marostegui: dbbackups: Migrate s8 backups db2100 -> db2098; reimage dbprov2001 [puppet] - 10https://gerrit.wikimedia.org/r/721288 (https://phabricator.wikimedia.org/T290868) (owner: 10Jcrespo) [11:40:39] ^ deploying some of those config changes [11:41:05] (03CR) 10jerkins-bot: [V: 04-1] dbbackups: Migrate s8 backups db2100 -> db2098; reimage dbprov2001 [puppet] - 10https://gerrit.wikimedia.org/r/721288 (https://phabricator.wikimedia.org/T290868) (owner: 10Jcrespo) [11:41:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:28] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:45:10] (03Merged) 10jenkins-bot: Set dispatchViaJobsAllowedClients to null everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730747 (https://phabricator.wikimedia.org/T291828) (owner: 10Lucas Werkmeister (WMDE)) [11:45:41] alright [11:45:41] (03PS1) 10Ssingh: anycast_monitoring: add check for durum [puppet] - 10https://gerrit.wikimedia.org/r/731399 [11:46:09] briefly testing that on mwdebug1001 [11:46:20] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:46:41] (03PS1) 10Marostegui: db2079: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/731401 (https://phabricator.wikimedia.org/T290868) [11:47:40] syncing [11:47:52] (03CR) 10Marostegui: [C: 03+2] db2079: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/731401 (https://phabricator.wikimedia.org/T290868) (owner: 10Marostegui) [11:48:33] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:730747|Set dispatchViaJobsAllowedClients to null everywhere (T291828)]] (duration: 00m 56s) [11:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:41] T291828: Remove transitionary Dispatch Config - https://phabricator.wikimedia.org/T291828 [11:49:11] !log Reimage db2079 (codfw s8 master) T290868 [11:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:17] T290868: Upgrade s8 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T290868 [11:49:35] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1003/31745/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/731399 (owner: 10Ssingh) [11:49:37] (03PS1) 10Btullis: Add initial personal dotfiles and one script [puppet] - 10https://gerrit.wikimedia.org/r/731403 (https://phabricator.wikimedia.org/T285754) [11:50:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:15] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove $wmgWikibaseDispatchViaJobsAllowedClients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730748 (https://phabricator.wikimedia.org/T291828) (owner: 10Lucas Werkmeister (WMDE)) [11:51:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2079.codfw.wmnet with OS buster [11:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:19] (03Merged) 10jenkins-bot: Remove $wmgWikibaseDispatchViaJobsAllowedClients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730748 (https://phabricator.wikimedia.org/T291828) (owner: 10Lucas Werkmeister (WMDE)) [11:52:28] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:53:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:38] PROBLEM - MariaDB Replica IO: s8 on db2098 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2079.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2079.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:54:10] ^ me [11:54:16] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:730748|Remove $wmgWikibaseDispatchViaJobsAllowedClients (T291828)]] (1/2) (duration: 00m 56s) [11:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:21] T291828: Remove transitionary Dispatch Config - https://phabricator.wikimedia.org/T291828 [11:54:59] (03PS1) 10Marostegui: dbbackups: Migrate s8 backups db2100 -> db2098 [puppet] - 10https://gerrit.wikimedia.org/r/731404 (https://phabricator.wikimedia.org/T290868) [11:55:24] (03Abandoned) 10Marostegui: dbbackups: Migrate s8 backups db2100 -> db2098; reimage dbprov2001 [puppet] - 10https://gerrit.wikimedia.org/r/721288 (https://phabricator.wikimedia.org/T290868) (owner: 10Jcrespo) [11:55:27] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:730748|Remove $wmgWikibaseDispatchViaJobsAllowedClients (T291828)]] (2/2) (duration: 00m 56s) [11:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:48] I’ll leave it at that for now and do the other half of my config changes later [11:55:57] !log UTC morning backport window done [11:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:02] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:06:29] 10SRE, 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10jbond) >>! In T291905#7435523, @jbond wrote: >>>! In T291905#7431136, @elukey wrote: >> To recap the next steps... [12:09:22] (03CR) 10Jbond: [C: 03+1] Remove alluxio resources from puppet [puppet] - 10https://gerrit.wikimedia.org/r/731115 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [12:10:14] RECOVERY - aqs endpoints health on aqs1012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:10:48] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.79 ms [12:10:56] RECOVERY - aqs endpoints health on aqs1010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:11:06] (03CR) 10Kormat: [V: 03+1 C: 03+2] mariadb: Add per-section alias [puppet] - 10https://gerrit.wikimedia.org/r/731339 (https://phabricator.wikimedia.org/T291352) (owner: 10Kormat) [12:11:12] RECOVERY - aqs endpoints health on aqs1011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:11:28] RECOVERY - aqs endpoints health on aqs1015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:11:36] RECOVERY - aqs endpoints health on aqs1013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:11:46] RECOVERY - aqs endpoints health on aqs1014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:12:00] RECOVERY - cassandra-b CQL 10.64.32.145:9042 on aqs1012 is OK: TCP OK - 0.000 second response time on 10.64.32.145 port 9042 https://phabricator.wikimedia.org/T93886 [12:13:12] RECOVERY - Disk space on aqs1013 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=aqs1013&var-datasource=eqiad+prometheus/ops [12:13:44] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:15:24] RECOVERY - Disk space on aqs1012 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=aqs1012&var-datasource=eqiad+prometheus/ops [12:16:13] (03CR) 10Jbond: [C: 03+1] "lgtm minor optional nit" [puppet] - 10https://gerrit.wikimedia.org/r/731403 (https://phabricator.wikimedia.org/T285754) (owner: 10Btullis) [12:22:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2079.codfw.wmnet with OS buster [12:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:11] RECOVERY - MariaDB Replica IO: s8 on db2098 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:32:13] (03CR) 10David Caro: [V: 03+1 C: 03+2] mariadb::packages: use ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/731348 (https://phabricator.wikimedia.org/T293604) (owner: 10David Caro) [12:32:25] (03CR) 10Kormat: [C: 03+1] dbbackups: Migrate s8 backups db2100 -> db2098 [puppet] - 10https://gerrit.wikimedia.org/r/731404 (https://phabricator.wikimedia.org/T290868) (owner: 10Marostegui) [12:33:31] (03PS1) 10Jbond: puppetboard: introduce puppetboard module [puppet] - 10https://gerrit.wikimedia.org/r/731427 [12:33:55] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:34:13] (03CR) 10jerkins-bot: [V: 04-1] puppetboard: introduce puppetboard module [puppet] - 10https://gerrit.wikimedia.org/r/731427 (owner: 10Jbond) [12:35:33] 10SRE-Access-Requests, 10Gerrit-Privilege-Requests, 10LDAP-Access-Requests: Offboard Tonina Zhelyazkova from WMF systems - https://phabricator.wikimedia.org/T293621 (10WMDE-leszek) [12:37:13] (03CR) 10David Caro: [C: 04-1] wmcs-srpeadcheck-tools: add new shorter webgrid names (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/731113 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [12:37:38] 10SRE-Access-Requests, 10Gerrit-Privilege-Requests, 10LDAP-Access-Requests: Offboard Tonina Zhelyazkova from WMF systems - https://phabricator.wikimedia.org/T293621 (10WMDE-leszek) [12:37:52] (03PS2) 10David Caro: wmcs-srpeadcheck-tools: add new shorter webgrid names [puppet] - 10https://gerrit.wikimedia.org/r/731113 (https://phabricator.wikimedia.org/T292465) [12:37:54] (03PS2) 10Jbond: puppetboard: introduce puppetboard module [puppet] - 10https://gerrit.wikimedia.org/r/731427 [12:38:36] (03CR) 10jerkins-bot: [V: 04-1] puppetboard: introduce puppetboard module [puppet] - 10https://gerrit.wikimedia.org/r/731427 (owner: 10Jbond) [12:39:21] (03CR) 10Marostegui: [C: 03+2] dbbackups: Migrate s8 backups db2100 -> db2098 [puppet] - 10https://gerrit.wikimedia.org/r/731404 (https://phabricator.wikimedia.org/T290868) (owner: 10Marostegui) [12:39:29] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.65 ms [12:40:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/731220 (owner: 10Majavah) [12:43:09] (03PS3) 10Jbond: puppetboard: introduce puppetboard module [puppet] - 10https://gerrit.wikimedia.org/r/731427 [12:43:40] (03CR) 10jerkins-bot: [V: 04-1] puppetboard: introduce puppetboard module [puppet] - 10https://gerrit.wikimedia.org/r/731427 (owner: 10Jbond) [12:46:22] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Epic, 10HTTPS: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10AlexisJazz) [12:59:13] (03CR) 10Vgutierrez: [C: 03+2] haproxy: Basic TLS terminator based on HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/715932 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [13:01:07] (03PS7) 10Vgutierrez: haproxy: Allow configuring TLS options [puppet] - 10https://gerrit.wikimedia.org/r/716000 (https://phabricator.wikimedia.org/T290005) [13:01:16] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM." [homer/public] - 10https://gerrit.wikimedia.org/r/728255 (https://phabricator.wikimedia.org/T288843) (owner: 10Ayounsi) [13:02:52] (03PS1) 10Filippo Giunchedi: statsd: failover writes to graphite2003 [puppet] - 10https://gerrit.wikimedia.org/r/731433 (https://phabricator.wikimedia.org/T247963) [13:02:54] (03PS1) 10Filippo Giunchedi: monitoring: check graphite2003 metrics [puppet] - 10https://gerrit.wikimedia.org/r/731434 (https://phabricator.wikimedia.org/T247963) [13:03:56] (03PS1) 10Filippo Giunchedi: discovery: move read traffic to graphite2003 [dns] - 10https://gerrit.wikimedia.org/r/731435 (https://phabricator.wikimedia.org/T247963) [13:03:58] (03PS1) 10Filippo Giunchedi: wmnet: move writes to graphite2003 [dns] - 10https://gerrit.wikimedia.org/r/731436 (https://phabricator.wikimedia.org/T247963) [13:04:45] (03CR) 10Vgutierrez: [C: 03+2] haproxy: Allow configuring TLS options [puppet] - 10https://gerrit.wikimedia.org/r/716000 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [13:04:50] (03CR) 10jerkins-bot: [V: 04-1] wmnet: move writes to graphite2003 [dns] - 10https://gerrit.wikimedia.org/r/731436 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [13:13:31] (03PS4) 10Jbond: puppetboard: introduce puppetboard module [puppet] - 10https://gerrit.wikimedia.org/r/731427 [13:14:04] (03CR) 10jerkins-bot: [V: 04-1] puppetboard: introduce puppetboard module [puppet] - 10https://gerrit.wikimedia.org/r/731427 (owner: 10Jbond) [13:14:25] (03CR) 10Vgutierrez: [C: 03+2] haproxy: STEK support [puppet] - 10https://gerrit.wikimedia.org/r/716224 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [13:14:40] (03PS9) 10Vgutierrez: haproxy: STEK support [puppet] - 10https://gerrit.wikimedia.org/r/716224 (https://phabricator.wikimedia.org/T290005) [13:18:59] PROBLEM - Check systemd state on dns5002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_systemd-timesyncd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:22:44] (03PS1) 10Michael DiPietro: Revert "wikireplicas: depool clouddb1020" [puppet] - 10https://gerrit.wikimedia.org/r/731416 [13:22:45] RECOVERY - cassandra-b CQL 10.64.32.147:9042 on aqs1013 is OK: TCP OK - 0.000 second response time on 10.64.32.147 port 9042 https://phabricator.wikimedia.org/T93886 [13:26:50] (03PS4) 10Vgutierrez: cache::haproxy: Configure sslcert::ocsp [puppet] - 10https://gerrit.wikimedia.org/r/719471 (https://phabricator.wikimedia.org/T290005) [13:27:34] jouncebot: nowandnext [13:27:35] No deployments scheduled for the next 2 hour(s) and 2 minute(s) [13:27:35] In 2 hour(s) and 2 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211018T1530) [13:27:53] I’ll continue with my config cleanups (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/731014 and next change) if that’s okay with everyone [13:28:43] (03CR) 10Michael DiPietro: [C: 03+2] Revert "wikireplicas: depool clouddb1020" [puppet] - 10https://gerrit.wikimedia.org/r/731416 (owner: 10Michael DiPietro) [13:30:02] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Unconditionally enable Wikibase dispatching via jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731014 (https://phabricator.wikimedia.org/T291828) (owner: 10Lucas Werkmeister (WMDE)) [13:30:19] (03PS2) 10Btullis: Add initial personal dotfiles and one script [puppet] - 10https://gerrit.wikimedia.org/r/731403 (https://phabricator.wikimedia.org/T285754) [13:30:52] (03Merged) 10jenkins-bot: Unconditionally enable Wikibase dispatching via jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731014 (https://phabricator.wikimedia.org/T291828) (owner: 10Lucas Werkmeister (WMDE)) [13:31:30] testing on mwdebug1001 [13:32:02] (03CR) 10Btullis: Add initial personal dotfiles and one script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/731403 (https://phabricator.wikimedia.org/T285754) (owner: 10Btullis) [13:32:16] (03PS3) 10Btullis: Remove alluxio resources from puppet [puppet] - 10https://gerrit.wikimedia.org/r/731115 (https://phabricator.wikimedia.org/T266641) [13:32:29] (03CR) 10Jbond: [C: 03+1] Add initial personal dotfiles and one script [puppet] - 10https://gerrit.wikimedia.org/r/731403 (https://phabricator.wikimedia.org/T285754) (owner: 10Btullis) [13:33:00] (03CR) 10Btullis: [C: 03+2] Add initial personal dotfiles and one script [puppet] - 10https://gerrit.wikimedia.org/r/731403 (https://phabricator.wikimedia.org/T285754) (owner: 10Btullis) [13:33:30] (03CR) 10Btullis: [C: 03+2] Remove alluxio resources from puppet [puppet] - 10https://gerrit.wikimedia.org/r/731115 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [13:34:10] seems fine, syncing [13:34:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:25] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:731014|Unconditionally enable Wikibase dispatching via jobs (T291828)]] (duration: 00m 56s) [13:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:31] T291828: Remove transitionary Dispatch Config - https://phabricator.wikimedia.org/T291828 [13:35:50] alright, I’ll watch out for a bit before deploying the other config change [13:37:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:46] 10SRE, 10MediaWiki-General, 10Platform Engineering Code Jam, 10Platform Engineering Roadmap Decision Making, 10Performance-Team (Radar): Allow easier ICU transitions in MediaWiki (change how sortkey collation is managed in the categorylinks table) - https://phabricator.wikimedia.org/T263437 (10jijiki) [13:38:50] (03CR) 10Vgutierrez: [C: 03+2] cache::haproxy: Configure sslcert::ocsp [puppet] - 10https://gerrit.wikimedia.org/r/719471 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [13:41:34] (03PS4) 10Vgutierrez: haproxy: Allow configuring timeouts [puppet] - 10https://gerrit.wikimedia.org/r/719479 (https://phabricator.wikimedia.org/T290005) [13:43:25] everything looks fine to me so I’ll do the other config change too [13:43:35] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove wmg variables for dispatch via jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731015 (https://phabricator.wikimedia.org/T291828) (owner: 10Lucas Werkmeister (WMDE)) [13:44:58] (03Merged) 10jenkins-bot: Remove wmg variables for dispatch via jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731015 (https://phabricator.wikimedia.org/T291828) (owner: 10Lucas Werkmeister (WMDE)) [13:47:16] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:731015|Remove wmg variables for dispatch via jobs (T291828)]] (1/2) (duration: 00m 56s) [13:47:20] 10SRE, 10User-herron: Rebalance kafka partitions in main-{eqiad,codfw} clusters - https://phabricator.wikimedia.org/T288825 (10elukey) main-eqiad is done, I've done this extra moves for the `statsv` topics (in both clusters): ` { "partitions": [ {"topic": "statsv", "partition": 0, "replicas": [2005,2002... [13:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:23] T291828: Remove transitionary Dispatch Config - https://phabricator.wikimedia.org/T291828 [13:47:55] <_joe_> jouncebot: noxt [13:48:01] <_joe_> meh [13:48:04] <_joe_> jouncebot: next [13:48:04] In 1 hour(s) and 41 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211018T1530) [13:48:12] I’m currently syncing [13:48:16] but almost done [13:48:22] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:731015|Remove wmg variables for dispatch via jobs (T291828)]] (2/2) (duration: 00m 56s) [13:48:27] tada [13:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:26] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for lbowmaker - https://phabricator.wikimedia.org/T293241 (10lbowmaker) @CDanis - the SSH part worked but I was having trouble accessing some sites. superset.wikimedia.org redirects to: https://idp.wikimedia.org/login?service=http... [13:51:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:04] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: improve rsyslog handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/731354 (owner: 10Giuseppe Lavagetto) [13:55:06] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1003/31746/" [puppet] - 10https://gerrit.wikimedia.org/r/731125 (https://phabricator.wikimedia.org/T293439) (owner: 10Herron) [13:55:59] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [13:56:05] (03PS2) 10Filippo Giunchedi: wmnet: move writes to graphite2003 [dns] - 10https://gerrit.wikimedia.org/r/731436 (https://phabricator.wikimedia.org/T247963) [13:57:55] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [13:58:04] (03CR) 10Herron: [C: 03+2] centrallog2002: apply role::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/730843 (https://phabricator.wikimedia.org/T292196) (owner: 10Herron) [13:58:09] PROBLEM - Check systemd state on dns4001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_systemd-timesyncd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:05] (03Merged) 10jenkins-bot: mediawiki: improve rsyslog handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/731354 (owner: 10Giuseppe Lavagetto) [14:03:24] (03CR) 10Ottomata: [C: 03+1] "Makes sense, not sure why it is the way it is now." [puppet] - 10https://gerrit.wikimedia.org/r/731332 (owner: 10Elukey) [14:03:41] (03CR) 10Elukey: [C: 03+2] profile::prometheus::alerts: update kafka mirror maker settings [puppet] - 10https://gerrit.wikimedia.org/r/731332 (owner: 10Elukey) [14:04:51] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:05:39] 10SRE, 10User-herron: Rebalance kafka partitions in main-{eqiad,codfw} clusters - https://phabricator.wikimedia.org/T288825 (10elukey) The topic moves have been completed for kafka main-eqiad, here's a list of timings of when the rebalance kicked off for each topic: ` Oct 18 13:30 statsv.json Oct 18 13:08 cod... [14:06:53] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:07:38] (03PS5) 10Vgutierrez: haproxy: Allow configuring timeouts [puppet] - 10https://gerrit.wikimedia.org/r/719479 (https://phabricator.wikimedia.org/T290005) [14:09:13] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) [14:09:28] (03CR) 10Vgutierrez: [C: 03+2] haproxy: Allow configuring timeouts [puppet] - 10https://gerrit.wikimedia.org/r/719479 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [14:09:45] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:52] Lucas_WMDE: All good? I ask because I've got a -labs.php only change that I'd like to merge and sync [14:11:02] phuzion: I’m all done, go ahead as far as I’m concerned [14:11:16] 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10elukey) [14:11:26] (03CR) 10Muehlenhoff: [C: 03+2] Add more role contacts [puppet] - 10https://gerrit.wikimedia.org/r/731093 (owner: 10Muehlenhoff) [14:11:32] 10SRE, 10User-herron: Rebalance kafka partitions in main-{eqiad,codfw} clusters - https://phabricator.wikimedia.org/T288825 (10elukey) 05Open→03Resolved a:03elukey Both clusters done, there is still some unbalance between the workers but for now it seems good enough for our use cases. It is a good trade-... [14:11:40] 10SRE, 10Analytics-Radar, 10Event-Platform, 10Platform Team Initiatives (Modern Event Platform (TEC2)), 10User-herron: Possibly expand Kafka main-{eqiad,codfw} clusters in Q4 2019. - https://phabricator.wikimedia.org/T217359 (10elukey) [14:12:04] sorry, that was supposed to be a ping to phuedx [14:12:09] should’ve checked what it autocompleted to [14:12:29] 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10elukey) 05Open→03Resolved [14:13:06] 10SRE: Audit existing Kafka main producers/consumers and document their configuration and use cases - https://phabricator.wikimedia.org/T220390 (10elukey) @herron this seems to be the last task for the kafka main transitioning that we opened a while ago. Anything worth doing? Otherwise we can decline and close i... [14:15:17] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:28] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10User-fgiunchedi: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10Volans) Downtimed `ps1-d1-codfw` until 2021-11-08 14:13:13 UTC on Icinga [14:15:32] 10SRE, 10User-herron: Transition Kafka main ownership from Analytics Engineering to SRE - (2018-2019 Q4 SRE Goal Tracking Task) - https://phabricator.wikimedia.org/T220387 (10herron) [14:15:47] 10SRE: Audit existing Kafka main producers/consumers and document their configuration and use cases - https://phabricator.wikimedia.org/T220390 (10herron) 05Stalled→03Declined Agreed! [14:17:59] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for lbowmaker - https://phabricator.wikimedia.org/T293241 (10lbowmaker) Also, this doesn't work either (same issue) - this should work on LDAP wmf membership https://grafana-rw.wikimedia.org/?orgId=1 [14:21:58] (03CR) 10Filippo Giunchedi: [C: 04-1] "While this would technically work I think we should pursue an higher level "routing table" to direct producers to a given kafka cluster. F" [puppet] - 10https://gerrit.wikimedia.org/r/731125 (https://phabricator.wikimedia.org/T293439) (owner: 10Herron) [14:23:16] 10SRE, 10SRE-swift-storage, 10ops-codfw: moss-be2002.mgmt alert for DNS - https://phabricator.wikimedia.org/T293610 (10Papaul) 05Open→03Resolved a:03Papaul It turns out be a back cable. I replace the cable , the system is back up. [14:25:18] (03PS2) 10Phuedx: [beta] Rename $wgIPInfoGeoIP2Path to $wgIPInfoGeoIP2Prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731346 (https://phabricator.wikimedia.org/T289361) [14:25:40] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for lbowmaker - https://phabricator.wikimedia.org/T293241 (10CDanis) 05Resolved→03Open Interesting, I haven't encountered this before. You are in the wmf LDAP group: https://ldap.toolforge.org/user/lbowmaker @MoritzMuehlenhoff... [14:25:46] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for lbowmaker - https://phabricator.wikimedia.org/T293241 (10lbowmaker) Please ignore the above, after logging into CAS, then out, then back in to CAS the links to superset, turnilo, etc worked. [14:25:47] ^^ That's the change I'm going to be syncing momentarily [14:26:37] (03CR) 10TsepoThoabala: [C: 03+1] [beta] Rename $wgIPInfoGeoIP2Path to $wgIPInfoGeoIP2Prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731346 (https://phabricator.wikimedia.org/T289361) (owner: 10Phuedx) [14:26:53] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for lbowmaker - https://phabricator.wikimedia.org/T293241 (10CDanis) 05Open→03Resolved Weird, but happy to hear it :) [14:29:40] (03CR) 10Phuedx: [C: 03+2] [beta] Rename $wgIPInfoGeoIP2Path to $wgIPInfoGeoIP2Prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731346 (https://phabricator.wikimedia.org/T289361) (owner: 10Phuedx) [14:30:21] (03Merged) 10jenkins-bot: [beta] Rename $wgIPInfoGeoIP2Path to $wgIPInfoGeoIP2Prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731346 (https://phabricator.wikimedia.org/T289361) (owner: 10Phuedx) [14:31:01] (03CR) 10Jbond: [C: 03+1] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/730210 (https://phabricator.wikimedia.org/T236208) (owner: 10BBlack) [14:32:05] 10SRE, 10User-herron: Transition Kafka main ownership from Analytics Engineering to SRE - (2018-2019 Q4 SRE Goal Tracking Task) - https://phabricator.wikimedia.org/T220387 (10elukey) 05Open→03Resolved a:03elukey [14:33:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:48] Syncing that change ^^ [14:36:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:36] !log phuedx@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:731346|[beta] Rename $wgIPInfoGeoIP2Path to $wgIPInfoGeoIP2Prefix (T289361)]] (duration: 00m 56s) [14:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:42] T289361: Rename $wgIPInfoGeoIP2Path or hard code the GeoLite2/GeoIP2- prefix [S] - https://phabricator.wikimedia.org/T289361 [14:40:01] (03PS5) 10Jbond: puppetboard: introduce puppetboard module [puppet] - 10https://gerrit.wikimedia.org/r/731427 [14:40:47] (03CR) 10jerkins-bot: [V: 04-1] puppetboard: introduce puppetboard module [puppet] - 10https://gerrit.wikimedia.org/r/731427 (owner: 10Jbond) [14:41:53] (03PS6) 10Jbond: puppetboard: introduce puppetboard module [puppet] - 10https://gerrit.wikimedia.org/r/731427 [14:43:13] (03PS4) 10Vgutierrez: haproxy: Add H2 performance tuning settings [puppet] - 10https://gerrit.wikimedia.org/r/719974 (https://phabricator.wikimedia.org/T290005) [14:43:49] PROBLEM - Hadoop NodeManager on an-worker1103 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:44:16] PROBLEM - Check systemd state on an-worker1119 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:02] PROBLEM - Hadoop NodeManager on an-worker1119 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:45:42] (03CR) 10Vgutierrez: [C: 03+2] haproxy: Add H2 performance tuning settings [puppet] - 10https://gerrit.wikimedia.org/r/719974 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [14:45:47] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [14:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:15] (03CR) 10Jbond: [C: 03+2] puppetboard: introduce puppetboard module [puppet] - 10https://gerrit.wikimedia.org/r/731427 (owner: 10Jbond) [14:47:56] RECOVERY - Check systemd state on an-worker1119 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:48:42] RECOVERY - Hadoop NodeManager on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:49:14] RECOVERY - Hadoop NodeManager on an-worker1103 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:49:14] PROBLEM - Check systemd state on dns1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_systemd-timesyncd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:49:14] (03PS3) 10Vgutierrez: haproxy: Add PROXY protocol support [puppet] - 10https://gerrit.wikimedia.org/r/720021 (https://phabricator.wikimedia.org/T290005) [14:49:56] PROBLEM - IPMI Sensor Status on db2088 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:50:12] PROBLEM - IPMI Sensor Status on elastic2034 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:50:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:45] (03CR) 10Vgutierrez: [C: 03+2] haproxy: Add PROXY protocol support [puppet] - 10https://gerrit.wikimedia.org/r/720021 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [14:54:07] (03PS4) 10Vgutierrez: haproxy: Allow adding/removing HTTP headers [puppet] - 10https://gerrit.wikimedia.org/r/720272 (https://phabricator.wikimedia.org/T290005) [14:54:19] !log rebuilt and uploaded kafkatee for bullseye T292196 [14:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:25] T292196: Put centrallog2002 in service - https://phabricator.wikimedia.org/T292196 [14:55:36] (03CR) 10Vgutierrez: [C: 03+2] haproxy: Allow adding/removing HTTP headers [puppet] - 10https://gerrit.wikimedia.org/r/720272 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [14:55:58] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=DELETE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:56:10] PROBLEM - IPMI Sensor Status on db2151 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:56:23] papaul: is codfw power you? [14:57:08] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:57:53] (03PS4) 10Vgutierrez: haproxy: Allow loading lua scripts [puppet] - 10https://gerrit.wikimedia.org/r/720273 (https://phabricator.wikimedia.org/T290005) [14:59:33] (03CR) 10Vgutierrez: [C: 03+2] haproxy: Allow loading lua scripts [puppet] - 10https://gerrit.wikimedia.org/r/720273 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [14:59:37] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 11 hosts with reason: Schema change s7 T281058 [14:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:43] T281058: Rename AbuseFilter indexes for consistency - https://phabricator.wikimedia.org/T281058 [14:59:46] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 11 hosts with reason: Schema change s7 T281058 [14:59:46] PROBLEM - IPMI Sensor Status on ganeti2025 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:05] (03CR) 10Cwhite: [C: 03+1] kafka_shipper: point codfw hosts to kafka-logging-codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/731125 (https://phabricator.wikimedia.org/T293439) (owner: 10Herron) [15:00:34] (03CR) 10Hnowlan: [C: 04-1] api-gateway: allow HTTP host header rewrite for discovery endpoints (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/730966 (https://phabricator.wikimedia.org/T288789) (owner: 10Elukey) [15:02:22] (03PS8) 10Vgutierrez: cache::haproxy: Manage request/response headers [puppet] - 10https://gerrit.wikimedia.org/r/720274 (https://phabricator.wikimedia.org/T290005) [15:03:41] (03PS1) 10Muehlenhoff: Buster tracking updates [puppet] - 10https://gerrit.wikimedia.org/r/731767 [15:04:43] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.16; 2021-07-26), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Kaganer) @Bawolff, As you may know, our colleague @Krassotkin was recently banned by the... [15:04:53] (03CR) 10Vgutierrez: [C: 03+2] cache::haproxy: Manage request/response headers [puppet] - 10https://gerrit.wikimedia.org/r/720274 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:05:04] 10SRE, 10SRE-Access-Requests, 10Gerrit-Privilege-Requests, 10LDAP-Access-Requests: Offboard Tonina Zhelyazkova from WMF systems - https://phabricator.wikimedia.org/T293621 (10Aklapper) As the Phab account is bound to a [WMDE SUL account](https://meta.wikimedia.org/wiki/Special:CentralAuth?target=Tonina%20Z... [15:06:04] PROBLEM - IPMI Sensor Status on db2139 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:06:50] PROBLEM - IPMI Sensor Status on restbase2017 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:07:04] papaul: related to your psu swap I guess? ^^^ [15:07:06] PROBLEM - IPMI Sensor Status on es2033 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:07:38] apparently we don't have those checks in Icinga as child of the related PSU [15:07:57] volans: yes [15:08:07] getting all bac k online nmow [15:08:57] (03CR) 10Muehlenhoff: [C: 03+2] Buster tracking updates [puppet] - 10https://gerrit.wikimedia.org/r/731767 (owner: 10Muehlenhoff) [15:09:32] vgutierrez: shall I puppet-merge your cache::haproxy patch along? [15:09:43] moritzm: go ahead please <3 [15:10:20] ack, doing that now [15:11:15] (03CR) 10Michael Große: [C: 03+1] mediawiki: Absent wikibase_repo_prune2 systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/726746 (https://phabricator.wikimedia.org/T292604) (owner: 10Ladsgroup) [15:11:51] (03CR) 10Michael Große: [C: 03+1] "I think this is ready to be merged after I785714c00a517c024e2023762d6beb282f5e2e2c" [puppet] - 10https://gerrit.wikimedia.org/r/731028 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [15:12:33] volans: PDU in place all the devices in the rack are back online [15:12:56] ack thanks! [15:16:06] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on 13 hosts with reason: Schema change s4 T281058 [15:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:12] T281058: Rename AbuseFilter indexes for consistency - https://phabricator.wikimedia.org/T281058 [15:16:16] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on 13 hosts with reason: Schema change s4 T281058 [15:16:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:49] !log reprepro copied anycast-healthchecker, python3-json-logger and python3-anycast-healthchecker from buster-wikimedia to bullseye-wikimedia T292196 [15:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:55] T292196: Put centrallog2002 in service - https://phabricator.wikimedia.org/T292196 [15:20:47] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for lbowmaker - https://phabricator.wikimedia.org/T293241 (10jbond) Just a note to say that CAS/IDP (and also mod_auth_cas) only resolve attributes including the `memberOf` attribute at session creation (when you login). so if you... [15:20:54] RECOVERY - IPMI Sensor Status on db2088 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:21:10] RECOVERY - IPMI Sensor Status on elastic2034 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:23:20] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 18:00:00 on 7 hosts with reason: Schema change s3 T281058 [15:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:26] T281058: Rename AbuseFilter indexes for consistency - https://phabricator.wikimedia.org/T281058 [15:23:26] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on 7 hosts with reason: Schema change s3 T281058 [15:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:55] (03PS1) 10ArielGlenn: index page and directory for Wikimedia Enterprise HTML dumps [puppet] - 10https://gerrit.wikimedia.org/r/731768 (https://phabricator.wikimedia.org/T273585) [15:24:06] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.16; 2021-07-26), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Joe) > **I ask you to explicitly confirm or refute this conclusion.** > Your conclusion... [15:24:59] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on labweb1002 - https://phabricator.wikimedia.org/T293428 (10Cmjohnson) @Bstorm @Andrew @Dzahn Replaced the disk and it's currently rebuilding cmjohnson@labweb1002:~$ cat /proc/mdstat Personalities : [raid1] [linear] [multipath] [raid0] [rai... [15:27:10] RECOVERY - IPMI Sensor Status on db2151 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:29:30] (03PS3) 10Jbond: mediawiki: add get_primary_dc function [software/spicerack] - 10https://gerrit.wikimedia.org/r/730440 [15:30:05] jan_drewniak: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikimedia Portals Update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211018T1530). [15:30:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:(Need By: TBD) rack/setup (4) fundraising hosts - https://phabricator.wikimedia.org/T289812 (10Cmjohnson) [15:30:46] RECOVERY - IPMI Sensor Status on ganeti2025 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:32:01] !log mvernon@cumin2002 START - Cookbook sre.discovery.service-route [15:32:01] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.discovery.service-route (exit_code=99) [15:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:47] (03PS1) 10Jbond: puppetboard: also add facts-contents [puppet] - 10https://gerrit.wikimedia.org/r/731771 [15:33:29] (03CR) 10Jbond: [C: 03+2] puppetboard: also add facts-contents [puppet] - 10https://gerrit.wikimedia.org/r/731771 (owner: 10Jbond) [15:35:34] (03PS1) 10Jbond: puppetboard: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/731772 [15:35:50] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: add get_primary_dc function [software/spicerack] - 10https://gerrit.wikimedia.org/r/730440 (owner: 10Jbond) [15:36:35] (03CR) 10Jbond: [C: 03+2] puppetboard: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/731772 (owner: 10Jbond) [15:37:06] RECOVERY - IPMI Sensor Status on db2139 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:37:52] RECOVERY - IPMI Sensor Status on restbase2017 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:38:10] RECOVERY - IPMI Sensor Status on es2033 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:39:08] (03PS1) 10Jbond: puppetboard: use correct file path settings.py [puppet] - 10https://gerrit.wikimedia.org/r/731773 [15:39:46] (03CR) 10jerkins-bot: [V: 04-1] puppetboard: use correct file path settings.py [puppet] - 10https://gerrit.wikimedia.org/r/731773 (owner: 10Jbond) [15:40:18] 10SRE, 10Acme-chief, 10Traffic, 10Patch-For-Review: Implement a watchdog mechanism on acme-chief - https://phabricator.wikimedia.org/T292619 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [15:40:51] (03PS2) 10Jbond: puppetboard: use correct file path settings.py [puppet] - 10https://gerrit.wikimedia.org/r/731773 [15:42:21] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/731774 [15:43:52] (03CR) 10Jbond: [C: 03+2] puppetboard: use correct file path settings.py [puppet] - 10https://gerrit.wikimedia.org/r/731773 (owner: 10Jbond) [15:44:00] 10SRE, 10MediaWiki-General, 10Platform Engineering Code Jam, 10Platform Engineering Roadmap Decision Making, 10Performance-Team (Radar): Allow easier ICU transitions in MediaWiki (change how sortkey collation is managed in the categorylinks table) - https://phabricator.wikimedia.org/T263437 (10Izno) > It... [15:47:38] RECOVERY - DNS on moss-be2002.mgmt is OK: DNS OK: 0.035 seconds response time. moss-be2002.mgmt.codfw.wmnet returns 10.193.0.151 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:50:33] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] mediawiki: Absent wikibase_repo_prune2 systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/726746 (https://phabricator.wikimedia.org/T292604) (owner: 10Ladsgroup) [15:51:49] (03CR) 10Herron: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/731774 (https://phabricator.wikimedia.org/T293439) (owner: 10Herron) [15:52:19] (03PS2) 10Lucas Werkmeister (WMDE): mediawiki: Drop absented wikibase_repo_prune2 systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/731028 (https://phabricator.wikimedia.org/T292604) [15:52:34] (03Abandoned) 10Lucas Werkmeister (WMDE): mediawiki: Absent wikidatawiki change pruning [puppet] - 10https://gerrit.wikimedia.org/r/731027 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [15:54:49] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1003/31749/" [puppet] - 10https://gerrit.wikimedia.org/r/731774 (https://phabricator.wikimedia.org/T293439) (owner: 10Herron) [15:55:27] (03PS4) 10Herron: kafka_shipper: map site -> brokers centrally & point codfw to site local brokers [puppet] - 10https://gerrit.wikimedia.org/r/731774 (https://phabricator.wikimedia.org/T293439) [15:56:06] (03PS1) 10Muehlenhoff: Buster tracking updates [puppet] - 10https://gerrit.wikimedia.org/r/731777 [15:57:53] (03CR) 10Muehlenhoff: [C: 03+2] Buster tracking updates [puppet] - 10https://gerrit.wikimedia.org/r/731777 (owner: 10Muehlenhoff) [16:05:20] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.16; 2021-07-26), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10matmarex) Bawolff doesn't even work for the WMF, and hasn't for two years now, please do... [16:08:50] RECOVERY - Device not healthy -SMART- on labweb1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labweb1002&var-datasource=eqiad+prometheus/ops [16:15:22] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:15:24] (03CR) 10Volans: "reply inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/730440 (owner: 10Jbond) [16:25:10] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.16; 2021-07-26), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10patilise) This is not a place to discuss office actions. Please only leave comments relat... [16:32:33] (03CR) 10David Caro: Buster tracking updates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/731777 (owner: 10Muehlenhoff) [16:33:40] PROBLEM - Check systemd state on an-worker1103 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:33:54] (03PS1) 10David Caro: site: cloudvirt1020 back to just virt [puppet] - 10https://gerrit.wikimedia.org/r/731781 [16:34:22] PROBLEM - Hadoop NodeManager on an-worker1103 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:37:50] (03CR) 10David Caro: [C: 03+2] site: cloudvirt1020 back to just virt [puppet] - 10https://gerrit.wikimedia.org/r/731781 (owner: 10David Caro) [16:38:20] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10User-fgiunchedi: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10Papaul) Note: The new Eaton PDU has only SNMP V1 and V3 [16:42:33] (03CR) 10Muehlenhoff: "Sorry for the breakage, that was entirely unintentional!" [puppet] - 10https://gerrit.wikimedia.org/r/731781 (owner: 10David Caro) [16:49:02] 10SRE, 10Analytics, 10Analytics-Kanban, 10wmfdata-python, 10Product-Analytics (Kanban): wmfdata.mariadb relies on analytics-mysql being available - https://phabricator.wikimedia.org/T292479 (10odimitrijevic) p:05Triage→03High [16:52:02] RECOVERY - Check systemd state on an-worker1103 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:52:42] RECOVERY - Hadoop NodeManager on an-worker1103 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:56:57] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on labweb1002 - https://phabricator.wikimedia.org/T293428 (10Dzahn) Thanks @Cmjohnson ! @Andrew fyi, this happened. I raised it because it seemed relatively urgent since it's one of 2 labweb backends and I was just triaging tickets on clini... [16:57:34] PROBLEM - Check systemd state on an-worker1117 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:57:58] PROBLEM - Hadoop NodeManager on an-worker1117 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:00:04] ryankemper: How many deployers does it take to do Wikidata Query Service weekly deploy deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211018T1700). [17:06:08] RECOVERY - Hadoop NodeManager on an-worker1117 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:06:12] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.16; 2021-07-26), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Kaganer) >>! In T287380#7437287, @patilise wrote: > This is not a place to discuss office... [17:07:46] RECOVERY - Check systemd state on an-worker1117 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:28] (03PS1) 10Jbond: P:puppetboard:ng: add new profile for puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/731783 [17:11:07] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.16; 2021-07-26), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Kaganer) >>! In T287380#7437205, @matmarex wrote: > Bawolff doesn't even work for the WMF... [17:11:37] (03PS4) 10Jbond: mediawiki: add get_primary_dc function [software/spicerack] - 10https://gerrit.wikimedia.org/r/730440 [17:11:54] (03CR) 10Jbond: "thanks updated" [software/spicerack] - 10https://gerrit.wikimedia.org/r/730440 (owner: 10Jbond) [17:23:56] mutante: like this? [17:24:02] urbanecm: yes, thanks :) [17:24:06] any time [17:24:10] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:24:39] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.16; 2021-07-26), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Aklapper) @Kaganer: The [etiquette](https://www.mediawiki.org/wiki/Bug_management/Phabric... [17:28:30] 10SRE, 10SRE-Access-Requests, 10Gerrit-Privilege-Requests, 10LDAP-Access-Requests: Offboard Tonina Zhelyazkova from WMF systems - https://phabricator.wikimedia.org/T293621 (10Dzahn) [17:28:51] 10SRE, 10SRE-Access-Requests, 10Gerrit-Privilege-Requests, 10LDAP-Access-Requests: Offboard Tonina Zhelyazkova from WMF systems - https://phabricator.wikimedia.org/T293621 (10Dzahn) removed from "nda" and "wmde" groups in LDAP [17:36:11] (03PS1) 10Legoktm: tests: MWHttpRequestTest is a unit test, not an integration test [core] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/731758 [17:37:05] (03PS1) 10Dzahn: admin: disable shell access for tonina [puppet] - 10https://gerrit.wikimedia.org/r/731787 (https://phabricator.wikimedia.org/T293621) [17:37:41] (03CR) 10Legoktm: "This change is ready for review." [core] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/731757 (https://phabricator.wikimedia.org/T288848) (owner: 10Legoktm) [17:38:09] (03CR) 10jerkins-bot: [V: 04-1] admin: disable shell access for tonina [puppet] - 10https://gerrit.wikimedia.org/r/731787 (https://phabricator.wikimedia.org/T293621) (owner: 10Dzahn) [17:41:28] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.16; 2021-07-26), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10matmarex) You may rest assured that I will ignore your comments in the future. Good luck... [17:43:33] (03PS2) 10Dzahn: admin: disable shell access for tonina [puppet] - 10https://gerrit.wikimedia.org/r/731787 (https://phabricator.wikimedia.org/T293621) [17:43:47] 10SRE, 10DBA, 10observability, 10Sustainability (Incident Followup): Monitor/dashboard number of queries killed by the automatic query killer - https://phabricator.wikimedia.org/T293531 (10Legoktm) Maybe I misread events_coredb_slave.sql, but it looked to me like it was intended to be cleared every 24h? Th... [17:46:01] 10SRE, 10Data-Persistence (Consultation), 10Performance-Team, 10Wikimedia-Rdbms, 10Sustainability (Incident Followup): Reimplement HHVM-like slow query log - https://phabricator.wikimedia.org/T293534 (10Legoktm) [17:48:00] 10SRE, 10Data-Persistence (Consultation), 10Performance-Team, 10Wikimedia-Rdbms, 10Sustainability (Incident Followup): Reimplement HHVM-like slow query log - https://phabricator.wikimedia.org/T293534 (10Legoktm) I recall this is already implemented in TransactionProfiler... [17:48:15] (03CR) 10Dzahn: [C: 03+2] admin: disable shell access for tonina [puppet] - 10https://gerrit.wikimedia.org/r/731787 (https://phabricator.wikimedia.org/T293621) (owner: 10Dzahn) [17:51:42] !log puppet run on all bastion hosts via cumin [17:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:52] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.16; 2021-07-26), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Kaganer) @Aklapper , This distracts from the topic. By way, I'm not a fresh user in Movem... [18:00:05] RoanKattouw and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211018T1800). [18:00:05] No Gerrit patches in the queue for this window AFAICS. [18:00:07] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.16; 2021-07-26), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10TechConductCommittee) Hello everyone, We would like to remind all discussion participant... [18:01:45] 10SRE, 10SRE-Access-Requests, 10Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 10Patch-For-Review: Offboard Tonina Zhelyazkova from WMF systems - https://phabricator.wikimedia.org/T293621 (10Dzahn) @WMDE-leszek ,re: >> I'd appreciate if WMF staff audited that they no longer have any staff-related ac... [18:02:30] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.16; 2021-07-26), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Aklapper) >>! In T287380#7437640, @Kaganer wrote: > And when was the previous one? See... [18:03:07] 10SRE, 10Data-Persistence (Consultation), 10Performance-Team, 10Wikimedia-Rdbms, 10Sustainability (Incident Followup): Reimplement HHVM-like slow query log - https://phabricator.wikimedia.org/T293534 (10Legoktm) Right, searching for `channel:DBPerformance AND message:"readQueryTime"` gives slow queries:... [18:04:24] 10SRE, 10SRE-Access-Requests, 10Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 10Patch-For-Review: Offboard Tonina Zhelyazkova from WMF systems - https://phabricator.wikimedia.org/T293621 (10Urbanecm) @WMDE-leszek Hey, I see https://meta.wikimedia.org/wiki/Special:CentralAuth?target=Tonina%20Zhelyazk... [18:05:04] PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:05:56] 10SRE, 10SRE-Access-Requests, 10Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 10Patch-For-Review: Offboard Tonina Zhelyazkova from WMF systems - https://phabricator.wikimedia.org/T293621 (10MoritzMuehlenhoff) Created https://phabricator.wikimedia.org/T293676 for the Data Engineering team to review d... [18:06:29] 10SRE, 10SRE-Access-Requests, 10Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 10Patch-For-Review: Offboard Tonina Zhelyazkova from WMF systems - https://phabricator.wikimedia.org/T293621 (10Dzahn) @WMDE-leszek Going through your list: LDAP groups removal: done, analytics-privatedata-users removal: d... [18:06:40] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.16; 2021-07-26), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Kaganer) >>! In T287380#7437660, @Aklapper wrote: >>>! In T287380#7437640, @Kaganer wrote... [18:07:30] !log gerrit - removed tonina from wmde-mediawiki gerrit group (T293621) [18:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:36] T293621: Offboard Tonina Zhelyazkova from WMF systems - https://phabricator.wikimedia.org/T293621 [18:08:58] 10SRE, 10SRE-Access-Requests, 10Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 10Patch-For-Review: Offboard Tonina Zhelyazkova from WMF systems - https://phabricator.wikimedia.org/T293621 (10Dzahn) [18:09:50] 10SRE, 10SRE-Access-Requests, 10Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 10Patch-For-Review: Offboard Tonina Zhelyazkova from WMF systems - https://phabricator.wikimedia.org/T293621 (10Dzahn) a:03Dzahn [18:10:58] 10SRE, 10Wikimedia-Mailing-lists: Disable "Unblock-pt-l" - https://phabricator.wikimedia.org/T293591 (10Dzahn) Completely disabling and redirecting mail are mutually exclusive I think. [18:12:13] 10SRE, 10SRE-Access-Requests, 10Gerrit-Privilege-Requests, 10LDAP-Access-Requests: Offboard Tonina Zhelyazkova from WMF systems - https://phabricator.wikimedia.org/T293621 (10Dzahn) [18:14:30] 10SRE, 10SRE-Access-Requests, 10Gerrit-Privilege-Requests, 10LDAP-Access-Requests: Offboard Tonina Zhelyazkova from WMF systems - https://phabricator.wikimedia.org/T293621 (10Dzahn) 05Open→03Resolved p:05Triage→03High [18:21:05] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.16; 2021-07-26), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Urbanecm) >>! In T287380#7437680, @Kaganer wrote: >>>! In T287380#7437660, @Aklapper wrot... [18:21:43] 10SRE, 10DBA, 10Sustainability (Incident Followup): Improve automatic query killer under high load - https://phabricator.wikimedia.org/T293532 (10Legoktm) >>! In T293532#7435235, @Marostegui wrote: > I wouldn't like to have an external process running all the time "just in case" The external process was jus... [18:21:57] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-access for NRodriguez - https://phabricator.wikimedia.org/T291508 (10Dzahn) 05Open→03Stalled [18:23:01] (03PS1) 10Andrew Bogott: nova vendordata: Add a second puppet agent --enable [puppet] - 10https://gerrit.wikimedia.org/r/731790 [18:25:14] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-access for NRodriguez - https://phabricator.wikimedia.org/T291508 (10Dzahn) a:03NRodriguez Hi Natalia, I'm assigning this back to you because we need a key from you to move forward on this and a different person handles access request t... [18:25:47] (03PS2) 10Ottomata: Convert $wgEventStreams to be an associative array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717589 (https://phabricator.wikimedia.org/T277193) (owner: 10Mholloway) [18:26:13] (03CR) 10Andrew Bogott: [C: 03+2] nova vendordata: Add a second puppet agent --enable [puppet] - 10https://gerrit.wikimedia.org/r/731790 (owner: 10Andrew Bogott) [18:26:45] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Epic, 10HTTPS: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10Dzahn) p:05Triage→03High [18:28:17] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.16; 2021-07-26), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Legoktm) >>! In T287380#7437640, @Kaganer wrote: > If these questions have already been a... [18:29:04] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DAbad - https://phabricator.wikimedia.org/T293253 (10Dzahn) [18:29:43] urbanecm: i'm going to merge and deploy a config change, i don't see anything happenign in this backport window [18:30:39] 10SRE, 10DBA, 10Sustainability (Incident Followup): Improve automatic query killer under high load - https://phabricator.wikimedia.org/T293532 (10Kormat) >>! In T293532#7437753, @Legoktm wrote: > So what needs to be done to get the query killer to a state that it functions under high load? Is that simply inf... [18:31:42] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DAbad - https://phabricator.wikimedia.org/T293253 (10Dzahn) a:03DAbad - confirmed L3 signature exists Hi @DAbad assigning this back to you because we need a key from you to proceed and access request tickets are handled by a... [18:35:46] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on labweb1002 - https://phabricator.wikimedia.org/T293428 (10Andrew) Thank you @Dzahn and @Cmjohnson [18:36:55] (03CR) 10Ottomata: [C: 03+2] Convert $wgEventStreams to be an associative array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717589 (https://phabricator.wikimedia.org/T277193) (owner: 10Mholloway) [18:38:56] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Epic, 10HTTPS: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10Dzahn) p:05High→03Medium [18:40:55] (03CR) 10Ottomata: "Did a diff of before and after in prod, and got same results." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717589 (https://phabricator.wikimedia.org/T277193) (owner: 10Mholloway) [18:42:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:51] 10SRE, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), 10Traffic, and 2 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10Dzahn) [18:44:45] 10SRE, 10Performance-Team, 10serviceops, 10MW-1.36-notes, and 3 others: Enable "/*/mw-with-onhost-tier/" route for MediaWiki where safe - https://phabricator.wikimedia.org/T264604 (10aaron) [18:45:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:32] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Convert $wgEventStreams to be an associative array - T277193 (duration: 00m 57s) [18:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:37] T277193: wgEventStreams (EventStreamConfig) should support per wiki overrides - https://phabricator.wikimedia.org/T277193 [18:51:16] @urbanecm I got my times messed up and thought the window was in 10 minutes not 50 minutes ago :P [18:51:37] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for echetty - https://phabricator.wikimedia.org/T293455 (10Dzahn) a:03DAbad Hello @DAbad here is another ticket for you. This is about approval for access for echetty. Do you approve? Feel free to assign the ticket back to me or... [18:51:41] I'm guessing I should wait till later? [18:51:55] jouncebot: nowandnext [18:51:56] For the next 0 hour(s) and 8 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211018T1800) [18:51:56] In 1 hour(s) and 8 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211018T2000) [18:52:11] Seddon: we still have eight minutes [18:52:22] What do you want to get deployed? [18:52:27] urbanecm: there's nothing after too [18:52:53] Yeah [18:53:01] urbanecm: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MediaSearch/+/731785 [18:53:31] (03PS1) 10Hashar: gitlab: fix connect-src CSP [puppet] - 10https://gerrit.wikimedia.org/r/731795 (https://phabricator.wikimedia.org/T285363) [18:53:43] (03PS1) 10Andrew Bogott: Add taavi/majavah root key [labs/private] - 10https://gerrit.wikimedia.org/r/731796 (https://phabricator.wikimedia.org/T292827) [18:54:09] (03CR) 10jerkins-bot: [V: 04-1] gitlab: fix connect-src CSP [puppet] - 10https://gerrit.wikimedia.org/r/731795 (https://phabricator.wikimedia.org/T285363) (owner: 10Hashar) [18:54:28] (03PS1) 10Ottomata: Test wgEventStream config merging in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731797 (https://phabricator.wikimedia.org/T277193) [18:54:30] (03PS2) 10Hashar: gitlab: fix connect-src CSP [puppet] - 10https://gerrit.wikimedia.org/r/731795 (https://phabricator.wikimedia.org/T285363) [18:54:57] (03PS1) 10Urbanecm: Revert 727328 [extensions/MediaSearch] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/731808 (https://phabricator.wikimedia.org/T293554) [18:55:02] (03CR) 10Urbanecm: [C: 03+2] Revert 727328 [extensions/MediaSearch] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/731808 (https://phabricator.wikimedia.org/T293554) (owner: 10Urbanecm) [18:55:15] Let's wait for CI then Seddon [18:55:22] Also +2'ed in master, too [18:55:30] @urbanecm could thanks! [18:55:33] cool* [18:55:37] You are the best [18:55:48] (03CR) 10jerkins-bot: [V: 04-1] Test wgEventStream config merging in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731797 (https://phabricator.wikimedia.org/T277193) (owner: 10Ottomata) [18:56:01] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Add taavi/majavah root key [labs/private] - 10https://gerrit.wikimedia.org/r/731796 (https://phabricator.wikimedia.org/T292827) (owner: 10Andrew Bogott) [18:56:41] (03PS1) 10Hashar: gitlab: enable report only CSP on primary [puppet] - 10https://gerrit.wikimedia.org/r/731798 (https://phabricator.wikimedia.org/T285363) [18:57:01] (03PS2) 10Ottomata: Test wgEventStream config merging in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731797 (https://phabricator.wikimedia.org/T277193) [18:58:04] (03CR) 10Hashar: "I am not sure why I haven't caught that one earlier. I have confirmed that the space separated sources are properly parsed by https://csp" [puppet] - 10https://gerrit.wikimedia.org/r/731795 (https://phabricator.wikimedia.org/T285363) (owner: 10Hashar) [18:58:21] * Spookreeeno agrees urbanecm is the best [18:58:58] Thank you :) [18:59:25] (03CR) 10Hashar: "After parent change https://gerrit.wikimedia.org/r/c/operations/puppet/+/731795 , the CSP directives should pass https://csp-evaluator.wit" [puppet] - 10https://gerrit.wikimedia.org/r/731798 (https://phabricator.wikimedia.org/T285363) (owner: 10Hashar) [19:03:52] (03CR) 10Andrew Bogott: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/723619 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [19:04:43] * urbanecm hypnotizes Jenkins to merge [19:05:28] (03PS1) 10Ottomata: Recursively merge beta settings with production settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731803 (https://phabricator.wikimedia.org/T132274) [19:05:47] (03CR) 10Ottomata: "NOTE: Can't use default for wgEventStreams unless https://phabricator.wikimedia.org/T132274 is fixed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731797 (https://phabricator.wikimedia.org/T277193) (owner: 10Ottomata) [19:06:02] (03CR) 10Ottomata: [C: 03+2] Test wgEventStream config merging in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731797 (https://phabricator.wikimedia.org/T277193) (owner: 10Ottomata) [19:06:37] (03CR) 10jerkins-bot: [V: 04-1] Recursively merge beta settings with production settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731803 (https://phabricator.wikimedia.org/T132274) (owner: 10Ottomata) [19:09:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:38] (03CR) 10Volans: [C: 03+1] "LGTM!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/730440 (owner: 10Jbond) [19:10:37] Seddon: https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php72-selenium-docker/82628/console failed :/ [19:12:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:25] (03CR) 10jerkins-bot: [V: 04-1] Revert 727328 [extensions/MediaSearch] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/731808 (https://phabricator.wikimedia.org/T293554) (owner: 10Urbanecm) [19:12:39] and apparently backport CI failed too [19:12:57] 20:00:07 npm ERR! request to https://registry.npmjs.org/pkg-dir/-/pkg-dir-3.0.0.tgz failed, reason: Socket timeout [19:13:07] yeah [19:13:30] still running for gate-and-submit [19:13:31] (03CR) 10Andrew Bogott: [C: 03+1] "Just talked to cwhite about this." [puppet] - 10https://gerrit.wikimedia.org/r/723619 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [19:13:41] removing the -1, let's see what the other job says [19:14:59] https://status.npmjs.org/ seems fine [19:15:43] might be an issue in our infra [19:15:49] or just a temporary failure [19:17:33] (03CR) 10Herron: [C: 03+1] statsd: failover writes to graphite2003 [puppet] - 10https://gerrit.wikimedia.org/r/731433 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [19:17:53] (03CR) 10Herron: [C: 03+1] monitoring: check graphite2003 metrics [puppet] - 10https://gerrit.wikimedia.org/r/731434 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [19:18:38] (03CR) 10Herron: [C: 03+1] discovery: move read traffic to graphite2003 [dns] - 10https://gerrit.wikimedia.org/r/731435 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [19:20:51] (03CR) 10Herron: [C: 03+1] wmnet: move writes to graphite2003 [dns] - 10https://gerrit.wikimedia.org/r/731436 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [19:22:51] (03Merged) 10jenkins-bot: Revert 727328 [extensions/MediaSearch] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/731808 (https://phabricator.wikimedia.org/T293554) (owner: 10Urbanecm) [19:22:56] finally [19:23:48] ottomata: i see you have an undeployed patch at deployment -- should i wait? [19:23:53] or can you fix that? [19:24:33] (03PS1) 10Ottomata: DO NOT MERGE - Test wgEventStreams overrides for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731804 (https://phabricator.wikimedia.org/T277193) [19:24:35] can fix, its just beta [19:24:44] thanks [19:24:58] rebased [19:25:08] go ahead urbanecm [19:25:10] thanks [19:25:31] Seddon: available at mwdebug1001 for you, can you test? [19:25:40] Yep will test now [19:25:56] thanks [19:26:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:07] urbanecm: we are good! [19:28:11] syncing! [19:29:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:21] Seddon: btw, for future reverts, gerrit has a nice button called "revert". It's usually beter, as it provides more information than a manually-written commit message. [19:29:34] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.4/extensions/MediaSearch/resources/store/state.js: ac7b4fc2ccc69589e00a42f49d18a8f6d71777f2: Revert 727328 (T293554) (duration: 00m 56s) [19:29:39] and, deployed [19:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:40] T293554: Switching tabs in MediaSearch does not re-query the search - https://phabricator.wikimedia.org/T293554 [19:29:48] @urbanecm I haven't pressed it before :P Next time I will. [19:29:58] !log LDAP: removed non-existent user gerrit2 from group labsadminbots (T160122) [19:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:03] T160122: Remove user gerrit2 from ldap - https://phabricator.wikimedia.org/T160122 [19:30:09] Seddon: appreciated! feel free to ask for help if you need any! [19:31:18] * urbanecm done with deployment [19:39:02] 10SRE, 10DBA, 10Sustainability (Incident Followup): Improve automatic query killer under high load - https://phabricator.wikimedia.org/T293532 (10Marostegui) >>! In T293532#7437753, @Legoktm wrote: >>>! In T293532#7435235, @Marostegui wrote: >> I wouldn't like to have an external process running all the time... [19:40:28] (03PS1) 10Bartosz Dziewoński: Enable topic subscriptions as a beta feature on all remaining projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731805 (https://phabricator.wikimedia.org/T287802) [19:40:39] (03CR) 10Andrew Bogott: "This (probably) broke the catalog on cloud-wide ntp servers" [puppet] - 10https://gerrit.wikimedia.org/r/730852 (owner: 10Jbond) [19:42:16] (03PS1) 10Bernard Wang: Bump DesktopWebUIActionsTracking sampling rate to 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731827 [19:42:26] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [19:42:27] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: When WMF staff requests to be added to ldap/wmf, also add their Phabricator account to #WMF-NDA - https://phabricator.wikimedia.org/T290605 (10Dzahn) a:03Dzahn [19:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:16] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: When WMF staff requests to be added to ldap/wmf, also add their Phabricator account to #WMF-NDA - https://phabricator.wikimedia.org/T290605 (10Dzahn) Yes, the proposal makes sense and I can confirm everything Andre listed in such detail above. Using the 2 se... [19:46:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:28] (03PS1) 10Herron: add centrallog2002 to codfw anycast_neighbors and syslog fw allows [homer/public] - 10https://gerrit.wikimedia.org/r/731828 (https://phabricator.wikimedia.org/T292196) [19:51:13] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:56:43] (03CR) 10Bernard Wang: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731827 (https://phabricator.wikimedia.org/T292588) (owner: 10Bernard Wang) [20:00:05] chrisalbon and accraze: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211018T2000). [20:05:50] 10SRE, 10Wikimedia-Mailing-lists: Disable "Unblock-pt-l" - https://phabricator.wikimedia.org/T293591 (10Legoktm) I'm not aware of any other current lists that now forward to a VRTS queue. I think there are two basic options: 1. Disable the Mailman list and set an autoresponder so anyone who tries to email it i... [20:07:01] RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:13:34] (03PS2) 10Ottomata: DO NOT MERGE - Test wgEventStreams overrides for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731804 (https://phabricator.wikimedia.org/T277193) [20:16:05] PROBLEM - SSH on gerrit2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:17:21] (03PS3) 10Ottomata: DO NOT MERGE - Test wgEventStreams overrides for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731804 (https://phabricator.wikimedia.org/T277193) [20:17:43] (03CR) 10Dzahn: "guys, what you are asking for is PS1 which jerkins hates because then:" [puppet] - 10https://gerrit.wikimedia.org/r/730863 (owner: 10Dzahn) [20:21:44] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ) - https://phabricator.wikimedia.org/T283582 (10Dzahn) @Papaul Could we schedule a firmware upgrade for gerrit2001 due to this issue? (not high prio) [20:23:47] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ) - https://phabricator.wikimedia.org/T283582 (10Papaul) @Dzahn sure we can. [20:26:45] (03PS2) 10Ottomata: Recursively merge beta settings with production settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731803 (https://phabricator.wikimedia.org/T132274) [20:27:20] (03CR) 10jerkins-bot: [V: 04-1] Recursively merge beta settings with production settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731803 (https://phabricator.wikimedia.org/T132274) (owner: 10Ottomata) [20:29:10] (03PS4) 10Ottomata: Comment changes for wgEventStreams about config merging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731804 (https://phabricator.wikimedia.org/T277193) [20:30:22] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ) - https://phabricator.wikimedia.org/T283582 (10Dzahn) @cmooney Thank you very much for all the debugging effort you put into this and thanks @Papaul for confirming... [20:31:14] ACKNOWLEDGEMENT - SSH on gerrit2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:31:14] ACKNOWLEDGEMENT - SSH on mw2253.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:31:14] ACKNOWLEDGEMENT - SSH on puppetmaster1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:31:14] ACKNOWLEDGEMENT - SSH on thumbor1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:31:19] 10SRE, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10wmfdata-python: wmfdata.mariadb relies on analytics-mysql being available - https://phabricator.wikimedia.org/T292479 (10nshahquinn-wmf) I think I can handle this just by using an absolute reference to the [file in refinery](https://github.co... [20:32:23] (03CR) 10Ottomata: [C: 03+2] "Also revert previous labs test change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731804 (https://phabricator.wikimedia.org/T277193) (owner: 10Ottomata) [20:32:52] 10SRE, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10wmfdata-python: wmfdata.mariadb relies on analytics-mysql being available - https://phabricator.wikimedia.org/T292479 (10nshahquinn-wmf) p:05High→03Medium It seems like the priority isn't //that// high since there's a pretty easy workarou... [20:34:54] (03PS3) 10Ottomata: Recursively merge beta settings with production settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731803 (https://phabricator.wikimedia.org/T132274) [20:38:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:07] (03PS1) 10Ottomata: wgEventStreams - remove redundant stream setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731833 (https://phabricator.wikimedia.org/T277193) [20:41:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:59] ACKNOWLEDGEMENT - Check systemd state on search-loader2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_mjolnir-kafka-bulk-daemon.service daniel_zahn reported to search, not getting traffic https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:45:50] (03PS4) 10Ottomata: Recursively merge beta settings with production settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731803 (https://phabricator.wikimedia.org/T132274) [20:48:37] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ) - https://phabricator.wikimedia.org/T283582 (10Papaul) @Dzahn I will go for turning this into a tracking ticket for firmware upgrades with check boxes of affected... [20:49:59] (03CR) 10Ottomata: "Huh, this kind of works, but all the configs in InitialiseSettings-labs.php that use 'default' will probably need to have '-' prefixed so " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731803 (https://phabricator.wikimedia.org/T132274) (owner: 10Ottomata) [20:50:39] (03CR) 10Ottomata: "This is more of an idea than something I expect to get through, whatchya think?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731803 (https://phabricator.wikimedia.org/T132274) (owner: 10Ottomata) [20:53:13] PROBLEM - Check systemd state on dns3001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_systemd-timesyncd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:58:06] !log mwmaint1002 - attempt to start mediawiki_job_translationnotifications-mediawikiwiki which was alerting as failed [20:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:05] Reedy and sbassett: That opportune time is upon us again. Time for a Weekly Security deployment window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211018T2100). [21:05:31] !log mwmaint1002 - sudo -u www-data /usr/local/bin/mw-cli-wrapper /usr/local/bin/mwscript extensions/TranslationNotifications/scripts/DigestEmailer.php --wiki mediawikiwiki | Fatal error: Uncaught Error: Class 'MediaWiki\MediaWikiServices' not found [21:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:10] mutante: lol [21:09:20] Want to file a bug for that? [21:09:23] https://github.com/wikimedia/mediawiki-extensions-TranslationNotifications/commit/813600deaa42e7208d5ef6fb957e5711310b69ea [21:09:39] Reedy: yes, writing it [21:11:18] https://phabricator.wikimedia.org/T293702 [21:11:46] (03PS1) 10Hashar: systemd::timer: spec coverage for splay parameter [puppet] - 10https://gerrit.wikimedia.org/r/731838 [21:11:48] (03PS1) 10Hashar: systemd::timer::job: add support for splay [puppet] - 10https://gerrit.wikimedia.org/r/731839 [21:11:50] (03PS1) 10Hashar: contint: regularly prune docker material [puppet] - 10https://gerrit.wikimedia.org/r/731840 (https://phabricator.wikimedia.org/T292729) [21:12:47] 10SRE, 10MediaWiki-extensions-TranslationNotifications, 10serviceops: Fatal error: Uncaught Error: Class 'MediaWiki\MediaWikiServices' not found - mediawiki_job_translationnotifications - https://phabricator.wikimedia.org/T293702 (10Dzahn) [21:12:56] (03CR) 10jerkins-bot: [V: 04-1] contint: regularly prune docker material [puppet] - 10https://gerrit.wikimedia.org/r/731840 (https://phabricator.wikimedia.org/T292729) (owner: 10Hashar) [21:13:14] (03CR) 10Hashar: "Concrete usage is in the child change https://gerrit.wikimedia.org/r/c/operations/puppet/+/731840 for which I want to spread the executio" [puppet] - 10https://gerrit.wikimedia.org/r/731839 (owner: 10Hashar) [21:13:19] ACKNOWLEDGEMENT - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_translationnotifications-mediawikiwiki.service,mediawiki_job_translationnotifications-metawiki.service daniel_zahn https://phabricator.wikimedia.org/T293702 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:14:08] 10SRE, 10MediaWiki-extensions-TranslationNotifications, 10serviceops: Fatal error: Uncaught Error: Class 'MediaWiki\MediaWikiServices' not found - mediawiki_job_translationnotifications - https://phabricator.wikimedia.org/T293702 (10Dzahn) 21:09 < Reedy> https://github.com/wikimedia/mediawiki-extensions-Tran... [21:17:03] RECOVERY - SSH on gerrit2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:23:04] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10User-fgiunchedi: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10Papaul) @fgiunchedi I got in touch with Eaton technical support team , they told me that the only options for SNMP are V1 and V3. I really don't like the ideal... [21:23:27] !log deployed security patch for T293556 [21:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:24] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: When WMF staff requests to be added to ldap/wmf, also add their Phabricator account to #WMF-NDA - https://phabricator.wikimedia.org/T290605 (10Dzahn) https://wikitech.wikimedia.org/w/index.php?title=SRE%2FLDAP&type=revision&diff=1929377&oldid=1924287 https:... [21:46:03] (03CR) 10Dzahn: "Seeing in Icinga that systemd state got degraded on dns* servers because wmf_auto_restart_systemd-timesyncd.service failed." [puppet] - 10https://gerrit.wikimedia.org/r/730852 (owner: 10Jbond) [21:49:06] (03PS1) 10BBlack: p::s::timesyncd: use ensure in auto_restarts [puppet] - 10https://gerrit.wikimedia.org/r/731843 [21:50:30] (03CR) 10BBlack: "This is a suggested fixup for the timesyncd auto_restart + dnsbox issue, not sure if it's the right fix, though!" [puppet] - 10https://gerrit.wikimedia.org/r/731843 (owner: 10BBlack) [21:50:58] (03CR) 10BBlack: standard::ntp: move standard ntp to its own profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730852 (owner: 10Jbond) [21:53:03] (03PS1) 10Dzahn: timesyncd: pass ensure to profile::auto_restarts::service [puppet] - 10https://gerrit.wikimedia.org/r/731844 [21:54:31] (03CR) 10jerkins-bot: [V: 04-1] timesyncd: pass ensure to profile::auto_restarts::service [puppet] - 10https://gerrit.wikimedia.org/r/731844 (owner: 10Dzahn) [21:55:28] (03CR) 10Dzahn: [C: 03+1] p::s::timesyncd: use ensure in auto_restarts [puppet] - 10https://gerrit.wikimedia.org/r/731843 (owner: 10BBlack) [21:56:29] ACKNOWLEDGEMENT - Check systemd state on dns1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_systemd-timesyncd.service daniel_zahn https://gerrit.wikimedia.org/r/c/operations/puppet/+/731843 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:56:29] ACKNOWLEDGEMENT - Check systemd state on dns3001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_systemd-timesyncd.service daniel_zahn https://gerrit.wikimedia.org/r/c/operations/puppet/+/731843 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:56:29] ACKNOWLEDGEMENT - Check systemd state on dns4001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_systemd-timesyncd.service daniel_zahn https://gerrit.wikimedia.org/r/c/operations/puppet/+/731843 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:56:29] ACKNOWLEDGEMENT - Check systemd state on dns5002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_systemd-timesyncd.service daniel_zahn https://gerrit.wikimedia.org/r/c/operations/puppet/+/731843 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:57:04] (03Abandoned) 10Dzahn: timesyncd: pass ensure to profile::auto_restarts::service [puppet] - 10https://gerrit.wikimedia.org/r/731844 (owner: 10Dzahn) [21:58:55] PROBLEM - Check systemd state on dns5001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_systemd-timesyncd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:00:16] 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Patch Telxius transport cross-connect to cr1-eqiad - https://phabricator.wikimedia.org/T293709 (10RobH) p:05Triage→03Medium [22:00:24] 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Patch Telxius transport cross-connect to cr1-eqiad - https://phabricator.wikimedia.org/T293709 (10RobH) [22:00:33] 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Patch Telxius transport cross-connect to cr1-eqiad - https://phabricator.wikimedia.org/T293709 (10RobH) [22:04:23] PROBLEM - Check systemd state on dns1002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_systemd-timesyncd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:06:08] 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Patch Telxius transport cross-connect to cr1-eqiad - https://phabricator.wikimedia.org/T293709 (10RobH) [22:06:46] !log deployed security patch for T293589 [22:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:05] PROBLEM - Check systemd state on dns3002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_systemd-timesyncd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:12:33] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: When WMF staff requests to be added to ldap/wmf, also add their Phabricator account to #WMF-NDA - https://phabricator.wikimedia.org/T290605 (10Dzahn) @Aklapper I mailed SRE about this and if there are no concerns coming up we can call this resolved. [22:20:24] (03PS1) 10Razzi: kerberos: Enable kerberos for Krishna Chaitanya Velaga [puppet] - 10https://gerrit.wikimedia.org/r/731847 (https://phabricator.wikimedia.org/T293189) [22:21:49] (03CR) 10Razzi: "I ran:" [puppet] - 10https://gerrit.wikimedia.org/r/731847 (https://phabricator.wikimedia.org/T293189) (owner: 10Razzi) [22:23:14] (03CR) 10Razzi: [C: 03+2] "Looked over it and looks good, self-merging." [puppet] - 10https://gerrit.wikimedia.org/r/731847 (https://phabricator.wikimedia.org/T293189) (owner: 10Razzi) [22:29:59] (03CR) 10Legoktm: [C: 03+1] "Seems reasonable" [puppet] - 10https://gerrit.wikimedia.org/r/731843 (owner: 10BBlack) [22:30:50] maryum: just to verify, are you done with the security window? I'd like to deploy a MW backport [22:30:57] (03CR) 10Legoktm: [C: 03+2] tests: MWHttpRequestTest is a unit test, not an integration test [core] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/731758 (owner: 10Legoktm) [22:30:59] (03CR) 10Legoktm: [C: 03+2] Allow using a reverse proxy for local HTTP requests [core] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/731757 (https://phabricator.wikimedia.org/T288848) (owner: 10Legoktm) [22:32:30] 10SRE-Access-Requests, 10Analytics, 10Patch-For-Review: Kerberos identity for kcv-wikimf - https://phabricator.wikimedia.org/T293189 (10Dzahn) [22:36:09] (03PS1) 10Brennen Bearnes: gitlab: remove cas3 from external providers [puppet] - 10https://gerrit.wikimedia.org/r/731849 (https://phabricator.wikimedia.org/T293696) [22:40:03] Urbanecm: I will be present in the backport "UTC late backport window" [22:40:20] (03CR) 10Jdlrobson: [C: 04-1] "That's not what's being asked in the ticket." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731827 (https://phabricator.wikimedia.org/T292588) (owner: 10Bernard Wang) [22:40:26] there really isn't need to ping me (or a deployer) in advance :)) [22:42:07] (03CR) 10Jdlrobson: [C: 04-1] Bump DesktopWebUIActionsTracking sampling rate to 100% (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731827 (https://phabricator.wikimedia.org/T292588) (owner: 10Bernard Wang) [22:43:18] 10SRE, 10Wikimedia-Mailing-lists: Disable "Unblock-pt-l" - https://phabricator.wikimedia.org/T293591 (10Albertoleoncio) Our need is basically to stop the flow of requests from the list and enforce the use of VRTS. The first option is good enough for this. The autoresponder can be something as simple as: ` Pre... [22:54:07] (03Merged) 10jenkins-bot: tests: MWHttpRequestTest is a unit test, not an integration test [core] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/731758 (owner: 10Legoktm) [22:54:12] (03Merged) 10jenkins-bot: Allow using a reverse proxy for local HTTP requests [core] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/731757 (https://phabricator.wikimedia.org/T288848) (owner: 10Legoktm) [22:56:34] !log legoktm@deploy1002 Synchronized php-1.38.0-wmf.4/includes/http/MWHttpRequest.php: Allow using a reverse proxy for local HTTP requests (T288848) (duration: 00m 56s) [22:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:40] T288848: Make HTTP calls work within mediawiki on kubernetes - https://phabricator.wikimedia.org/T288848 [22:59:59] (03CR) 10Dzahn: [C: 03+1] "lgtm from glancing at the docs" [puppet] - 10https://gerrit.wikimedia.org/r/731849 (https://phabricator.wikimedia.org/T293696) (owner: 10Brennen Bearnes) [23:00:04] RoanKattouw and Urbanecm: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211018T2300) [23:00:04] Juan_90264: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:41] i can deploy today [23:00:43] Juan_90264: around? [23:01:30] Still here [23:01:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:46] (03PS4) 10Urbanecm: Create Rhymes namespace for thwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730738 (https://phabricator.wikimedia.org/T291761) (owner: 10Juan90264) [23:01:49] (03CR) 10Urbanecm: [C: 03+2] Create Rhymes namespace for thwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730738 (https://phabricator.wikimedia.org/T291761) (owner: 10Juan90264) [23:02:36] (03Merged) 10jenkins-bot: Create Rhymes namespace for thwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730738 (https://phabricator.wikimedia.org/T291761) (owner: 10Juan90264) [23:03:00] Juan_90264: can you test? it's at mwdebug1001 [23:03:22] Yes, i can [23:04:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:41] so go ahead :) [23:07:15] (03PS4) 10Urbanecm: Create an alias for the Draft namespace on hrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730744 (https://phabricator.wikimedia.org/T291755) (owner: 10Juan90264) [23:08:44] Juan_9026463: how is it going? i see you have some internet trouble. [23:08:53] Sorry, errors on my internet :) [23:08:54] I tested it and now I approve [23:09:16] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/730738/ [23:09:28] syncing [23:09:34] (03CR) 10Urbanecm: [C: 03+2] Create an alias for the Draft namespace on hrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730744 (https://phabricator.wikimedia.org/T291755) (owner: 10Juan90264) [23:10:15] PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:10:17] (03Merged) 10jenkins-bot: Create an alias for the Draft namespace on hrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730744 (https://phabricator.wikimedia.org/T291755) (owner: 10Juan90264) [23:10:55] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: abe777d28594da852e49ccb1c1597b2598f3e483: Create Rhymes namespace for thwiktionary (T291761) (duration: 00m 57s) [23:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:01] T291761: Request for new namespaces at Thai Wiktionary - https://phabricator.wikimedia.org/T291761 [23:11:03] first change liv [23:11:05] *live [23:11:18] Juan_9026463: second change available at mwdebug1001, please test. [23:11:31] Yes, i can [23:12:22] !log [urbanecm@mwmaint1002 ~]$ mwscript namespaceDupes.php --wiki=thwiktionary --fix # T291761 [23:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:33] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:12:53] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:13:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:13:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:28] Urbanecm: I tested and approved [23:15:32] syncing [23:15:35] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/730744/ [23:16:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:46] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: b654980240d51fff3c6e9c48f7076d4609c2560f: Create an alias for the Draft namespace on hrwiki (T291755) (duration: 00m 56s) [23:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:52] T291755: hrwiki: create namespace alias Draft -> Nacrt (Draft: is mistakenly considered part of ns:0) - https://phabricator.wikimedia.org/T291755 [23:17:25] Juan_9026463: how were the images for https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/731231 created? [23:17:33] https://github.com/wikimedia/operations-mediawiki-config/blob/master/logos/config.yaml#L1645 explicitly mentions a commons image [23:17:45] or is it just using the reuploaded version? [23:21:12] Juan_9026463: can you answer the question above? [23:22:47] Urbanecm: I used the SVGs provided by the task creator that used 155x155 and resized to 135x155. Because these svgs didn't leave the default size to be followed [23:23:34] In Wiktionary I just forgot to resize [23:23:52] (03PS6) 10Urbanecm: Repair the size of the logo of Kashmiri Wikipedia and Kashmiri Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731231 (https://phabricator.wikimedia.org/T293373) (owner: 10Juan90264) [23:24:13] how did you generate the PNGs themselves (i mean, technically)? [23:24:23] On Wikipedia, the user task creator found the dialect used in the logo small so I adjusted the logo as a whole to improve the visualization [23:25:36] Urbanecm: Using Paint 3D, and an application to optimize the logo [23:26:11] so, you're not using https://gerrit.wikimedia.org/g/operations/mediawiki-config/%2B/refs/heads/master/logos/? [23:26:19] (which is linked from https://wikitech.wikimedia.org/wiki/Wikimedia_site_requests#Change_the_logo_of_a_Wikimedia_wiki)? [23:26:41] (03PS1) 10Urbanecm: [DNM] Run logo manager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731853 [23:27:14] if you open the diffs at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/731231 and https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/731853/, they're different [23:27:32] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/731853/ uses the correct procedure (tox -e logos -- update kswiki) [23:27:49] Urbanecm: Yes I used in [23:28:04] I don't understand your message [23:29:07] Urbanecm: Yes I used in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/730737 and https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/730736 [23:29:29] but I'm asking about https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/731231 :)) [23:29:49] (03PS1) 10DLynch: Add event stream config for discussiontools [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731854 (https://phabricator.wikimedia.org/T286076) [23:30:00] (03CR) 10Urbanecm: [C: 04-1] "this manually changes the PNG files, which are maintained by the logos scripts (https://gerrit.wikimedia.org/g/operations/mediawiki-config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731231 (https://phabricator.wikimedia.org/T293373) (owner: 10Juan90264) [23:30:29] -1'ed, I'm not going to deploy this one today (as demonstrated in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/731853/, the updater creates visually different PNGs) [23:33:05] @Urbanecm: So I should have removed the commons file citation from config.yaml? [23:33:31] the correct file should be uploaded to Commons :) [23:34:14] (03CR) 10Cwhite: [C: 03+1] monitoring: check graphite2003 metrics [puppet] - 10https://gerrit.wikimedia.org/r/731434 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [23:34:42] (03CR) 10Cwhite: [C: 03+1] statsd: failover writes to graphite2003 [puppet] - 10https://gerrit.wikimedia.org/r/731433 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [23:35:06] (03CR) 10Cwhite: [C: 03+1] discovery: move read traffic to graphite2003 [dns] - 10https://gerrit.wikimedia.org/r/731435 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [23:35:43] legoktm: Okay, but what did I ask if it finds a valid alternative? [23:36:06] (03CR) 10Cwhite: [C: 03+1] wmnet: move writes to graphite2003 [dns] - 10https://gerrit.wikimedia.org/r/731436 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [23:37:39] Hello? [23:37:50] Juan_9026463: I don't really understand what you're trying to say, sorry. But the logos committed in the repository, should be the same as the SVGs on Commons. That way anyone can edit the SVGs when they want to adjust the logo [23:38:29] +1 to what legoktm just said :) [23:39:51] Okay [23:39:53] But is there a standard size for adding logos, like 135x155, or can I use the best one is possible with that SVG, eg 155x155? [23:40:20] !log Updated the Wikidata property suggester with data from the 2021-10-04 JSON dump (with pre-applied T132839 workarounds) [23:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:26] T132839: [RfC] Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [23:40:37] *that SVG (Commons Hosted) [23:40:59] In general, you shouldn't touch the PNGs manually, the updater script should be capable of generating them correctly. [23:41:04] That includes sizes, too :) [23:41:24] the standard sizes are the 1.5x and 2x thumbnails that Commons generates [23:41:41] see https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/logos/manage.py#107 [23:43:11] Now I understand, I thank you and possibly tomorrow I will seek to bring the change with the necessary correction [23:43:39] Juan_9026463: also see https://www.mediawiki.org/wiki/Manual:$wgLogos#Supported_variants, "The 1x version should be 135px wide by up to ~155px tall" [23:43:41] (03CR) 10Cwhite: [C: 03+1] kafka_shipper: map site -> brokers centrally & point codfw to site local brokers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/731774 (https://phabricator.wikimedia.org/T293439) (owner: 10Herron) [23:44:13] Juan_9026463: thank you for working on logos! :) [23:44:33] :) [23:44:39] Yup :). Logos are one of the hardest usual config changes, I would say [23:49:22] Only now that my network stabilizes on IRC, I'm glad I was able to be present [23:53:16] Urbanecm: I believe that "UTC late B&C window done" was missing