[00:00:58] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:03:02] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:10:28] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:40] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:55:32] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:01:44] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:16:03] (03CR) 10BryanDavis: [C: 03+1] logging: Remove host 'ip' field [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730905 (https://phabricator.wikimedia.org/T114700) (owner: 10Krinkle) [01:26:12] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:55:26] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:01:36] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:18:54] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200): /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [02:20:58] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [02:55:18] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:01:32] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:10:32] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:23:40] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:55:02] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:01:02] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:10:36] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:30:06] (03PS1) 10Marostegui: db1109: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/734060 (https://phabricator.wikimedia.org/T290868) [04:30:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1109 (s8) for reimage T290868', diff saved to https://phabricator.wikimedia.org/P17590 and previous config saved to /var/cache/conftool/dbconfig/20211025-043028-marostegui.json [04:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:30:35] T290868: Upgrade s8 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T290868 [04:31:00] (03CR) 10Marostegui: [C: 03+2] db1109: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/734060 (https://phabricator.wikimedia.org/T290868) (owner: 10Marostegui) [04:37:28] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 351 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:39:32] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 6 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:40:12] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:46:24] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:50:33] (03CR) 10Marostegui: "My suggestion to deploy this would be to merge, and recreate the view only on one host first, and if the view doesn't break any other quer" [puppet] - 10https://gerrit.wikimedia.org/r/732740 (https://phabricator.wikimedia.org/T292594) (owner: 10Samuel (WMF)) [04:55:03] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:58:19] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:01:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1109.eqiad.wmnet with OS buster [05:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:11] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:10:23] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:26:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1109.eqiad.wmnet with OS buster [05:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:26] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, and 2 others: The restricted/mediawiki-webserver image should include skins and resources - https://phabricator.wikimedia.org/T285232 (10Joe) >>! In T285232#7443847, @Joe wrote: > Sadly I found a problem with our current approach: any file under static/... [05:43:53] <_joe_> !log pooling wtp1042 T294212 [05:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:00] T294212: wtp1026 and wtp1042 continue to be depooled - https://phabricator.wikimedia.org/T294212 [05:44:34] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:46:10] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:47:02] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=parsoid,name=wtp1026.* [05:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:57] (03PS1) 10Elukey: kubernetes: add revscoring-articlequality to ml-serve [puppet] - 10https://gerrit.wikimedia.org/r/734063 (https://phabricator.wikimedia.org/T294141) [06:14:52] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:16:38] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:19:35] (03PS1) 10Elukey: role::ci::master: remove old kubernetes config [labs/private] - 10https://gerrit.wikimedia.org/r/734065 [06:19:41] (03PS1) 10Elukey: kubernetes: add tokens and secrets for revscoring-articlequality [labs/private] - 10https://gerrit.wikimedia.org/r/734066 (https://phabricator.wikimedia.org/T294141) [06:32:28] (03PS1) 10Elukey: Add the revscoring-articlequality ns to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/734067 (https://phabricator.wikimedia.org/T294141) [06:49:16] (03CR) 10Elukey: "Ben: feel free to deploy this if you have time!" [puppet] - 10https://gerrit.wikimedia.org/r/732573 (owner: 10Elukey) [07:04:32] (03CR) 10Elukey: [C: 03+2] kubernetes: add revscoring-articlequality to ml-serve [puppet] - 10https://gerrit.wikimedia.org/r/734063 (https://phabricator.wikimedia.org/T294141) (owner: 10Elukey) [07:05:31] (03CR) 10Kevin Bazira: "This lgtm, just a few notes;" [deployment-charts] - 10https://gerrit.wikimedia.org/r/733034 (https://phabricator.wikimedia.org/T294141) (owner: 10Accraze) [07:10:24] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:11:02] (03CR) 10Elukey: ml-services: add enwiki-articlequality (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/733034 (https://phabricator.wikimedia.org/T294141) (owner: 10Accraze) [07:12:13] (03CR) 10Elukey: [C: 03+2] Add the revscoring-articlequality ns to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/734067 (https://phabricator.wikimedia.org/T294141) (owner: 10Elukey) [07:13:08] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:16:32] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:17:12] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:17:37] (03PS8) 10Ema: upgrade-varnish: support frontend instance only [cookbooks] - 10https://gerrit.wikimedia.org/r/731935 [07:23:22] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:24:15] (03CR) 10Ema: [C: 03+2] upgrade-varnish: support frontend instance only [cookbooks] - 10https://gerrit.wikimedia.org/r/731935 (owner: 10Ema) [07:27:47] 10SRE, 10SRE-swift-storage, 10MediaWiki-extensions-Score, 10Performance-Team (Radar): Add cache key information to metadata json - https://phabricator.wikimedia.org/T257093 (10fgiunchedi) >>! In T257093#7453291, @Reedy wrote: > Ok, I have now deleted everything before `20210101000000` > > ` > reedy@mwmain... [07:29:15] (03CR) 10Ema: [C: 03+2] Use ats-tls metrics for edge traffic drop alert [alerts] - 10https://gerrit.wikimedia.org/r/732970 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [07:36:52] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31869/console" [puppet] - 10https://gerrit.wikimedia.org/r/733083 (https://phabricator.wikimedia.org/T294080) (owner: 10Dzahn) [07:45:44] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+1] "LGTM, only a whitespace diff" [puppet] - 10https://gerrit.wikimedia.org/r/733083 (https://phabricator.wikimedia.org/T294080) (owner: 10Dzahn) [07:50:14] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [07:50:17] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [07:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:22] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:56:41] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Effeietsanders - https://phabricator.wikimedia.org/T294038 (10Effeietsanders) @Dzahn thanks! Unfortunately ssh keeps asking for a password when I try to ssh into bast: `ssh -v bast1003.eqiad.wmnet` Any idea what... [07:57:21] (03CR) 10MMandere: [C: 03+2] grafana: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/732959 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [08:00:19] (03PS6) 10Volans: Add drmrs to DNS (Netbox generated records) [dns] - 10https://gerrit.wikimedia.org/r/732380 (https://phabricator.wikimedia.org/T282787) (owner: 10Ayounsi) [08:00:27] (03CR) 10Volans: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/732380 (https://phabricator.wikimedia.org/T282787) (owner: 10Ayounsi) [08:01:34] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:03:25] * volans looking at netbox1001 [08:04:20] volans: that's been flapping all weekend [08:04:38] saw that [08:06:00] RECOVERY - Disk space on ms-be2028 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2028&var-datasource=codfw+prometheus/ops [08:06:55] (03CR) 10Ayounsi: [C: 03+2] Add drmrs to DNS (Netbox generated records) [dns] - 10https://gerrit.wikimedia.org/r/732380 (https://phabricator.wikimedia.org/T282787) (owner: 10Ayounsi) [08:07:08] (03CR) 10Vgutierrez: [C: 03+1] tlsproxy::localssl: acme_chief should notify nginx [puppet] - 10https://gerrit.wikimedia.org/r/732611 (https://phabricator.wikimedia.org/T293826) (owner: 10Elukey) [08:07:25] (03PS3) 10Hashar: gitlab: fix connect-src CSP [puppet] - 10https://gerrit.wikimedia.org/r/731795 (https://phabricator.wikimedia.org/T285363) [08:07:33] (03PS2) 10Hashar: gitlab: enable report only CSP on primary [puppet] - 10https://gerrit.wikimedia.org/r/731798 (https://phabricator.wikimedia.org/T285363) [08:08:07] !log merge DNS changes to add drmrs [08:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:44] (03PS1) 10Volans: netbox: remove sync from codfw ganeti cluster [puppet] - 10https://gerrit.wikimedia.org/r/734204 (https://phabricator.wikimedia.org/T286206) [08:10:56] (03CR) 10Volans: [C: 03+2] "Self-merging to prevent alert from flapping." [puppet] - 10https://gerrit.wikimedia.org/r/734204 (https://phabricator.wikimedia.org/T286206) (owner: 10Volans) [08:13:36] PROBLEM - SSH on thumbor1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:21:08] (03CR) 10Jelto: "lgtm +1" [puppet] - 10https://gerrit.wikimedia.org/r/731795 (https://phabricator.wikimedia.org/T285363) (owner: 10Hashar) [08:23:31] (03CR) 10Jelto: [C: 03+1] "forgot the +1 ..." [puppet] - 10https://gerrit.wikimedia.org/r/731795 (https://phabricator.wikimedia.org/T285363) (owner: 10Hashar) [08:29:51] (03CR) 10Gehel: "I think the dependencies could be simplified a bit more. Otherwise, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/732611 (https://phabricator.wikimedia.org/T293826) (owner: 10Elukey) [08:30:39] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01019 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [08:35:30] (03PS1) 10Volans: cumin: fix alias query [puppet] - 10https://gerrit.wikimedia.org/r/734205 [08:38:22] (03CR) 10Elukey: tlsproxy::localssl: acme_chief should notify nginx (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732611 (https://phabricator.wikimedia.org/T293826) (owner: 10Elukey) [08:39:28] (03PS2) 10Elukey: tlsproxy::localssl: acme_chief should notify nginx [puppet] - 10https://gerrit.wikimedia.org/r/732611 (https://phabricator.wikimedia.org/T293826) [08:40:34] (03CR) 10Jelto: [C: 03+2] gitlab: fix connect-src CSP [puppet] - 10https://gerrit.wikimedia.org/r/731795 (https://phabricator.wikimedia.org/T285363) (owner: 10Hashar) [08:41:47] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:41:58] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/732611 (https://phabricator.wikimedia.org/T293826) (owner: 10Elukey) [08:44:39] (03CR) 10Volans: [C: 03+2] "Self-merge to fix alias query and prevent cron-spam from the check that all aliases are correct." [puppet] - 10https://gerrit.wikimedia.org/r/734205 (owner: 10Volans) [08:46:11] (03PS1) 10Volans: cumin: temporary remove ganeti-test alias [puppet] - 10https://gerrit.wikimedia.org/r/734206 (https://phabricator.wikimedia.org/T286206) [08:46:55] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Data-Persistence: Degraded RAID on ms-be2028 - https://phabricator.wikimedia.org/T294001 (10MatthewVernon) [08:48:20] (03CR) 10Volans: [C: 03+2] cumin: temporary remove ganeti-test alias [puppet] - 10https://gerrit.wikimedia.org/r/734206 (https://phabricator.wikimedia.org/T286206) (owner: 10Volans) [08:48:40] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Data-Persistence: Degraded RAID on ms-be2028 - https://phabricator.wikimedia.org/T294001 (10MatthewVernon) [subscribing so I get a ping once we know if there's an available spare or not] [08:48:42] (03CR) 10Elukey: "Adding also Matthew and Filippo since the change impacts ms-fe* nodes" [puppet] - 10https://gerrit.wikimedia.org/r/732611 (https://phabricator.wikimedia.org/T293826) (owner: 10Elukey) [08:50:09] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/731798 (https://phabricator.wikimedia.org/T285363) (owner: 10Hashar) [08:58:48] (03CR) 10Jelto: [C: 03+2] gitlab: enable report only CSP on primary [puppet] - 10https://gerrit.wikimedia.org/r/731798 (https://phabricator.wikimedia.org/T285363) (owner: 10Hashar) [09:00:27] (03CR) 10Filippo Giunchedi: "LGTM (not voting though since IIRC ms-fe doesn't use acme-chief)" [puppet] - 10https://gerrit.wikimedia.org/r/732611 (https://phabricator.wikimedia.org/T293826) (owner: 10Elukey) [09:01:37] (03PS1) 10Ayounsi: Revert "Add drmrs to DNS (Netbox generated records)" [dns] - 10https://gerrit.wikimedia.org/r/733914 [09:01:45] (03PS1) 10Jbond: Revert "Add drmrs to DNS (Netbox generated records)" [dns] - 10https://gerrit.wikimedia.org/r/733915 [09:01:57] (03Abandoned) 10Jbond: Revert "Add drmrs to DNS (Netbox generated records)" [dns] - 10https://gerrit.wikimedia.org/r/733915 (owner: 10Jbond) [09:02:05] (03CR) 10Jbond: [C: 03+1] Revert "Add drmrs to DNS (Netbox generated records)" [dns] - 10https://gerrit.wikimedia.org/r/733914 (owner: 10Ayounsi) [09:03:17] (03CR) 10Jbond: [C: 03+2] Revert "Add drmrs to DNS (Netbox generated records)" [dns] - 10https://gerrit.wikimedia.org/r/733914 (owner: 10Ayounsi) [09:07:55] (03CR) 10Btullis: [C: 03+2] hive: bump hive server's heap settings to 10g [puppet] - 10https://gerrit.wikimedia.org/r/732573 (owner: 10Elukey) [09:14:27] RECOVERY - SSH on thumbor1001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:14:46] (03PS1) 10Volans: Add drmrs to DNS (Netbox generated records) [dns] - 10https://gerrit.wikimedia.org/r/734208 (https://phabricator.wikimedia.org/T282787) [09:17:06] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Regularly resubmit changes that might be stuck in wb_changes [puppet] - 10https://gerrit.wikimedia.org/r/732972 (https://phabricator.wikimedia.org/T294031) (owner: 10Michael Große) [09:18:41] !log bounce graphite-web on graphite2003 to test timeout bump - T294220 [09:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:48] T294220: Graphite query timeout from ExtensionDistributor - https://phabricator.wikimedia.org/T294220 [09:19:09] (03CR) 10Jbond: [C: 03+1] "LGTM, optional comment" [dns] - 10https://gerrit.wikimedia.org/r/734208 (https://phabricator.wikimedia.org/T282787) (owner: 10Volans) [09:20:23] (03PS1) 10Vgutierrez: cache: Provide a HAproxy upload role [puppet] - 10https://gerrit.wikimedia.org/r/734209 (https://phabricator.wikimedia.org/T290005) [09:22:37] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [09:35:25] (03CR) 10Klausman: [C: 03+1] ml-services: add enwiki-articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/733034 (https://phabricator.wikimedia.org/T294141) (owner: 10Accraze) [09:35:44] (03CR) 10Klausman: [C: 03+1] kubernetes: add tokens and secrets for revscoring-articlequality [labs/private] - 10https://gerrit.wikimedia.org/r/734066 (https://phabricator.wikimedia.org/T294141) (owner: 10Elukey) [09:40:53] (03PS2) 10Urbanecm: Add pwn to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/733140 (https://phabricator.wikimedia.org/T292415) (owner: 10Gerrit maintenance bot) [09:41:03] (03PS2) 10Urbanecm: Add ami to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/733141 (https://phabricator.wikimedia.org/T292414) (owner: 10Gerrit maintenance bot) [09:42:32] if someone could +2 the two patches above, I'd appreciate it :) [09:43:39] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:733089|[BETA CLUSTER] Enable WikibaseLexeme Scribunto access (T294159)]] (merged on Friday, syncing now to avoid outdated files even if it’s just -labs.php) (duration: 00m 55s) [09:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:46] T294159: Enable Lexeme access on first set of projects - https://phabricator.wikimedia.org/T294159 [09:46:46] (03CR) 10Volans: [C: 03+2] "LGTM, thanks for the patch!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/732919 (https://phabricator.wikimedia.org/T294082) (owner: 10Majavah) [09:47:47] (03Merged) 10jenkins-bot: Use most specific prefix for dns record site assignment [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/732919 (https://phabricator.wikimedia.org/T294082) (owner: 10Majavah) [09:48:04] (03PS1) 10Jbond: cookbook sre.dns.wipe-cache: cookbook to clear stale DNS entries [cookbooks] - 10https://gerrit.wikimedia.org/r/734214 [09:48:58] !log volans@cumin1001 START - Cookbook sre.dns.netbox [09:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:48] (03CR) 10jerkins-bot: [V: 04-1] cookbook sre.dns.wipe-cache: cookbook to clear stale DNS entries [cookbooks] - 10https://gerrit.wikimedia.org/r/734214 (owner: 10Jbond) [09:52:11] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:19] !log bounce uwsgi graphite web on graphite2003 - T294220 [09:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:25] T294220: Graphite query timeout from ExtensionDistributor - https://phabricator.wikimedia.org/T294220 [09:54:12] (03PS2) 10Jbond: cookbook sre.dns.wipe-cache: cookbook to clear stale DNS entries [cookbooks] - 10https://gerrit.wikimedia.org/r/734214 [10:05:39] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:05:46] (03CR) 10Jbond: [C: 03+1] "LGTM, optional nit" [puppet] - 10https://gerrit.wikimedia.org/r/733083 (https://phabricator.wikimedia.org/T294080) (owner: 10Dzahn) [10:22:41] (03PS1) 10Vgutierrez: cache: Expose prometheus metrics for HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/734223 (https://phabricator.wikimedia.org/T290005) [10:24:33] (03PS1) 10Filippo Giunchedi: graphite: bump fetch_timeout [puppet] - 10https://gerrit.wikimedia.org/r/734224 (https://phabricator.wikimedia.org/T247963) [10:24:35] (03PS1) 10Filippo Giunchedi: graphite: set CLUSTER_SERVERS empty with no remote servers [puppet] - 10https://gerrit.wikimedia.org/r/734225 (https://phabricator.wikimedia.org/T247963) [10:29:08] (03PS2) 10Filippo Giunchedi: graphite: set CLUSTER_SERVERS empty with no remote servers [puppet] - 10https://gerrit.wikimedia.org/r/734225 (https://phabricator.wikimedia.org/T247963) [10:32:44] (03CR) 10Volans: "Nice addition, few nits inline, nothing as a blocker." [cookbooks] - 10https://gerrit.wikimedia.org/r/734214 (owner: 10Jbond) [10:44:42] (03PS1) 10Majavah: dynamicproxy: remove duplicate listen 80 [puppet] - 10https://gerrit.wikimedia.org/r/734227 [10:56:55] 10SRE-Access-Requests: Requesting access to Production Shell Groups & JupyterHub for echetty - https://phabricator.wikimedia.org/T294229 (10EChetty) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211025T1100). [11:00:04] Lucas_WMDE: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:06] (03PS3) 10Jbond: cookbook sre.dns.wipe-cache: cookbook to clear stale DNS entries [cookbooks] - 10https://gerrit.wikimedia.org/r/734214 [11:00:09] o/ [11:00:11] o/ [11:00:12] (03CR) 10Jbond: "updated thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/734214 (owner: 10Jbond) [11:00:15] Lucas_WMDE: assuming you'll self-service? [11:00:18] yup :) [11:00:26] shout if I'm needed then :) [11:00:28] * urbanecm hides [11:00:31] but all my changes are no-op cleanups so if anyone else wants to deploy, just let me know [11:00:33] ok! [11:02:23] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove dispatchViaJobs-related Wikibase settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732372 (https://phabricator.wikimedia.org/T291828) (owner: 10Lucas Werkmeister (WMDE)) [11:02:34] (03CR) 10jerkins-bot: [V: 04-1] cookbook sre.dns.wipe-cache: cookbook to clear stale DNS entries [cookbooks] - 10https://gerrit.wikimedia.org/r/734214 (owner: 10Jbond) [11:03:21] (03Merged) 10jenkins-bot: Remove dispatchViaJobs-related Wikibase settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732372 (https://phabricator.wikimedia.org/T291828) (owner: 10Lucas Werkmeister (WMDE)) [11:05:57] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:732372|Remove dispatchViaJobs-related Wikibase settings (T291828)]] (duration: 00m 56s) [11:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:05] T291828: Remove transitionary Dispatch Config - https://phabricator.wikimedia.org/T291828 [11:06:38] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove dispatchChanges.php-related Wikibase settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732949 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [11:06:46] 10SRE-Access-Requests: Requesting Kerberos identity for echetty - https://phabricator.wikimedia.org/T294231 (10EChetty) [11:07:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:18] (03Merged) 10jenkins-bot: Remove dispatchChanges.php-related Wikibase settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732949 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [11:09:13] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:732949|Remove dispatchChanges.php-related Wikibase settings (T292604)]] (duration: 00m 55s) [11:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:20] T292604: Clean up old change dispatching code - https://phabricator.wikimedia.org/T292604 [11:09:38] (03PS1) 10Jbond: P:puppetmaster::ng: Enable catalog and update inventory facts [puppet] - 10https://gerrit.wikimedia.org/r/734232 (https://phabricator.wikimedia.org/T264276) [11:10:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:19] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove wmg variables for dispatchChanges.php Wikibase settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732950 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [11:11:56] (03PS4) 10Jbond: cookbook sre.dns.wipe-cache: cookbook to clear stale DNS entries [cookbooks] - 10https://gerrit.wikimedia.org/r/734214 [11:12:40] (03Merged) 10jenkins-bot: Remove wmg variables for dispatchChanges.php Wikibase settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732950 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [11:13:10] (03PS2) 10Jbond: P:puppetmaster::ng: Enable catalog and update inventory facts [puppet] - 10https://gerrit.wikimedia.org/r/734232 (https://phabricator.wikimedia.org/T264276) [11:14:29] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:732950|Remove wmg variables for dispatchChanges.php Wikibase settings (T292604)]] (duration: 00m 55s) [11:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:35] T292604: Clean up old change dispatching code - https://phabricator.wikimedia.org/T292604 [11:14:59] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove wikibaseDispatchRedisLockManager config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732951 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [11:16:04] (03PS3) 10Jbond: P:puppetmaster::ng: Enable catalog and update inventory facts [puppet] - 10https://gerrit.wikimedia.org/r/734232 (https://phabricator.wikimedia.org/T264276) [11:16:08] (03Merged) 10jenkins-bot: Remove wikibaseDispatchRedisLockManager config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732951 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [11:18:00] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:732951|Remove wikibaseDispatchRedisLockManager config (T292604)]] (duration: 00m 54s) [11:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:14] 10SRE, 10Infrastructure-Foundations, 10netops, 10observability, 10Sustainability (Incident Followup): Alert that should have paged did not reach VictorOps because of partial networking outage - https://phabricator.wikimedia.org/T294166 (10Volans) AFAIK we're still alerting just sending emails to VO inste... [11:18:53] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove dispatchLagToMaxLagFactor Wikibase setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732969 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [11:19:02] (03PS1) 10MMandere: cumin: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/734245 (https://phabricator.wikimedia.org/T282787) [11:19:04] (03CR) 10jerkins-bot: [V: 04-1] Remove dispatchLagToMaxLagFactor Wikibase setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732969 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [11:19:40] (03PS2) 10Lucas Werkmeister (WMDE): Remove dispatchLagToMaxLagFactor Wikibase setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732969 (https://phabricator.wikimedia.org/T292604) [11:20:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:06] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732969 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [11:20:52] (03Merged) 10jenkins-bot: Remove dispatchLagToMaxLagFactor Wikibase setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732969 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [11:22:48] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:732969|Remove dispatchLagToMaxLagFactor Wikibase setting (T292604)]] (duration: 00m 54s) [11:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:55] T292604: Clean up old change dispatching code - https://phabricator.wikimedia.org/T292604 [11:24:32] !log UTC morning backport+config window done [11:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:19] 10SRE-Access-Requests: Requesting access to Production Shell Groups & JupyterHub for echetty - https://phabricator.wikimedia.org/T294229 (10RhinosF1) Needs @Ottomata or @odimitrijevic's approval. [11:29:48] 10SRE-Access-Requests: Requesting access to Production Shell Groups & JupyterHub for echetty - https://phabricator.wikimedia.org/T294229 (10RhinosF1) Note: wikidev isn't manually done and researchers is no longer used. [11:29:59] (03PS1) 10Ema: Add 0009-vsl-perf-stability.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/734249 (https://phabricator.wikimedia.org/T293879) [11:31:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31875/console" [puppet] - 10https://gerrit.wikimedia.org/r/734232 (https://phabricator.wikimedia.org/T264276) (owner: 10Jbond) [11:33:15] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:36:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:09] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:37:25] 10SRE-Access-Requests, 10Data-Engineering: Requesting Kerberos identity for echetty - https://phabricator.wikimedia.org/T294231 (10Aklapper) [11:38:38] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:puppetmaster::ng: Enable catalog and update inventory facts [puppet] - 10https://gerrit.wikimedia.org/r/734232 (https://phabricator.wikimedia.org/T264276) (owner: 10Jbond) [11:40:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:21] 10SRE-Access-Requests, 10Data-Engineering: Requesting Kerberos identity for echetty - https://phabricator.wikimedia.org/T294231 (10RhinosF1) Duplicate of T294229 [11:40:33] 10SRE-Access-Requests, 10Data-Engineering: Requesting Kerberos identity for echetty - https://phabricator.wikimedia.org/T294231 (10RhinosF1) [11:40:35] 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Production Shell Groups & JupyterHub for echetty - https://phabricator.wikimedia.org/T294229 (10RhinosF1) [11:40:58] 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Production Shell Groups & JupyterHub for echetty - https://phabricator.wikimedia.org/T294229 (10RhinosF1) [11:42:56] (03PS3) 10RhinosF1: Add echetty to product-users and ssh access [puppet] - 10https://gerrit.wikimedia.org/r/733916 (https://phabricator.wikimedia.org/T294229) [11:45:40] (03PS3) 10Filippo Giunchedi: graphite: set CLUSTER_SERVERS empty with no remote servers [puppet] - 10https://gerrit.wikimedia.org/r/734225 (https://phabricator.wikimedia.org/T247963) [11:48:55] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:50:51] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:51:56] (03CR) 10Filippo Giunchedi: "I realized CLUSTER_SERVERS included graphite2003 (itself) on graphite2003, resulting in "remote" queries being issued" [puppet] - 10https://gerrit.wikimedia.org/r/734225 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [11:52:00] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31876/console" [puppet] - 10https://gerrit.wikimedia.org/r/734225 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [11:56:25] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:57:09] !log deployment-cache-text06: upgrade varnish to 6.0.8-1wm2 T293879 [11:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:16] T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 [11:57:39] (03PS1) 10Hashar: (DO NOT MERGE) compile deploy-1002 [puppet] - 10https://gerrit.wikimedia.org/r/734254 [11:58:11] (03PS2) 10Hashar: (DO NOT MERGE) compile deploy-1002 [puppet] - 10https://gerrit.wikimedia.org/r/734254 [11:58:27] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/734254 (owner: 10Hashar) [11:59:31] (03CR) 10jerkins-bot: [V: 04-1] Add 0009-vsl-perf-stability.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/734249 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [12:00:39] (03PS3) 10Hashar: (DO NOT MERGE) compile deploy-1002 [puppet] - 10https://gerrit.wikimedia.org/r/734254 [12:01:03] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/734254 (owner: 10Hashar) [12:01:33] (03CR) 10Ayounsi: Add drmrs to DNS (Netbox generated records) (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/734208 (https://phabricator.wikimedia.org/T282787) (owner: 10Volans) [12:02:33] 10SRE, 10SRE-swift-storage, 10MediaWiki-extensions-Score, 10Performance-Team (Radar): Add cache key information to metadata json - https://phabricator.wikimedia.org/T257093 (10Reedy) That's quite a reduction! :) [12:04:09] !log cp3062: upgrade varnish to 6.0.8-1wm2 T293879 [12:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:16] T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 [12:04:25] (03Abandoned) 10Hashar: (DO NOT MERGE) compile deploy-1002 [puppet] - 10https://gerrit.wikimedia.org/r/734254 (owner: 10Hashar) [12:06:01] (03PS1) 10Jbond: P:puppetboard::ng: add notify for service [puppet] - 10https://gerrit.wikimedia.org/r/734255 [12:18:24] (03PS2) 10Jbond: P:puppetboard::ng: add notify for service [puppet] - 10https://gerrit.wikimedia.org/r/734255 (https://phabricator.wikimedia.org/T264276) [12:19:06] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31878/console" [puppet] - 10https://gerrit.wikimedia.org/r/734255 (https://phabricator.wikimedia.org/T264276) (owner: 10Jbond) [12:19:32] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:puppetboard::ng: add notify for service [puppet] - 10https://gerrit.wikimedia.org/r/734255 (https://phabricator.wikimedia.org/T264276) (owner: 10Jbond) [12:23:29] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 90.21% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [12:29:52] (03CR) 10Volans: [C: 04-1] "LGTM but this has to wait that each alias matches at least 1 host before merging it (or split into 3 different CRs if they will be online " [puppet] - 10https://gerrit.wikimedia.org/r/734245 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [12:31:32] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/734214 (owner: 10Jbond) [12:34:06] How can I trigger build of image in https://docker-registry.wikimedia.org/wikimedia/mediawiki-services-cxserver/tags/ (Patch in question is: https://gerrit.wikimedia.org/r/c/mediawiki/services/cxserver/+/727269 where it failed first, but recheck was successful later) [12:36:56] (03PS1) 10Jbond: P:puppetboard::ng: drop icinga check [puppet] - 10https://gerrit.wikimedia.org/r/734257 (https://phabricator.wikimedia.org/T264276) [12:38:15] kart_: https://integration.wikimedia.org/ci/job/trigger-service-pipeline-test-and-publish/2188/ [12:38:44] (03CR) 10Jbond: [C: 03+2] P:puppetboard::ng: drop icinga check [puppet] - 10https://gerrit.wikimedia.org/r/734257 (https://phabricator.wikimedia.org/T264276) (owner: 10Jbond) [12:39:07] (03PS5) 10Jbond: cookbook sre.dns.wipe-cache: cookbook to clear stale DNS entries [cookbooks] - 10https://gerrit.wikimedia.org/r/734214 [12:39:37] (03CR) 10Jbond: [C: 03+2] cookbook sre.dns.wipe-cache: cookbook to clear stale DNS entries (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/734214 (owner: 10Jbond) [12:41:38] (03CR) 10MMandere: cumin: Add drmrs DC site (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/734245 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [12:42:21] (03Merged) 10jenkins-bot: cookbook sre.dns.wipe-cache: cookbook to clear stale DNS entries [cookbooks] - 10https://gerrit.wikimedia.org/r/734214 (owner: 10Jbond) [12:42:29] (03CR) 10Ema: [C: 03+2] varnishttfb.mtail: use native histogram type [puppet] - 10https://gerrit.wikimedia.org/r/732925 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [12:44:14] kart_: and it passed [12:44:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={gitaly,gitlab,nginx,redis_gitlab,sidekiq,workhorse} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:44:57] ^ thats me, expected [12:45:12] What was the magic, @Reedy? :) [12:45:42] kart_: With the right jenkins rights... Just go to https://integration.wikimedia.org/ci/job/trigger-service-pipeline-test-and-publish/2187/ login, and hit "rebuild" [12:45:59] OK. Thanks! :) [12:46:19] (ie very easy/trivial if you have the correct access) [12:46:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:46:30] Got it. I have access. [12:53:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={gitlab,redis_gitlab,sidekiq,workhorse} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:06:46] (03PS1) 10Btullis: Move analytics-hive to an-coord1002 [dns] - 10https://gerrit.wikimedia.org/r/734260 [13:08:45] 10SRE, 10Observability-Logging, 10Traffic, 10Patch-For-Review, 10User-ema: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 (10ema) >>! In T293879#7454657, @gerritbot wrote: > Change 732925 **merged** by Ema: > %%%[operations/puppet@produ... [13:10:17] (03PS1) 10Jbond: puppetboard: add puppetboard as an active/active service [dns] - 10https://gerrit.wikimedia.org/r/734262 [13:10:23] (03PS1) 10Jbond: puppetboard: add puppetboard as an active/active service [puppet] - 10https://gerrit.wikimedia.org/r/734263 [13:12:44] (03PS1) 10Jbond: apt: update apt service to critical service [puppet] - 10https://gerrit.wikimedia.org/r/734264 [13:14:01] (03CR) 10Jbond: [C: 03+2] apt: update apt service to critical service [puppet] - 10https://gerrit.wikimedia.org/r/734264 (owner: 10Jbond) [13:22:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:23:38] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Production Shell Groups & JupyterHub for echetty - https://phabricator.wikimedia.org/T294229 (10Ottomata) Approved. Emil will need a Kerberos ticket as well. [13:27:36] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={redis_gitlab,sidekiq,workhorse} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:32:34] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Joe) I have a (supposedly) working set of normalizing rules to interpret and structure both the php-fpm error log (mostly uninteresting) and the... [13:33:35] (03CR) 10Elukey: [C: 03+1] Move analytics-hive to an-coord1002 [dns] - 10https://gerrit.wikimedia.org/r/734260 (owner: 10Btullis) [13:38:05] 10SRE, 10ops-eqiad, 10Sustainability (Incident Followup): eqiad: patch 2nd Equinix IXP - https://phabricator.wikimedia.org/T293726 (10ayounsi) Giving more details on our current process and what happened here. When configuring a new circuit, we enable the interface on our side ahead of time, so DCops can c... [13:45:44] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/733916 (https://phabricator.wikimedia.org/T294229) (owner: 10RhinosF1) [13:46:31] jbond: ty [13:47:15] i think only manager approval is needed now [13:47:25] 10Puppet, 10Infrastructure-Foundations, 10GitLab (Infrastructure), 10Patch-For-Review, and 3 others: Puppetise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 (10Jelto) I identified at least two issues which prevent us from having a successful restore: One is puppet agent runs automati... [13:48:11] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Production Shell Groups & JupyterHub for echetty - https://phabricator.wikimedia.org/T294229 (10RhinosF1) @DAbad: As with last ticket, this should need your approval. [13:51:46] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:01:01] 10SRE, 10ops-eqdfw: cr2-eqdfw: PEM 1 Input Voltage Out Of Range flapping - https://phabricator.wikimedia.org/T294009 (10Papaul) @ayounsi I checked the status of both PEM today. looks good to me. Do you want to to close the task ` PEM 0 status: State Online Airflow... [14:04:32] 10SRE, 10ops-eqdfw: cr2-eqdfw: PEM 1 Input Voltage Out Of Range flapping - https://phabricator.wikimedia.org/T294009 (10ayounsi) Looks like it's still alerting: ` cr2-eqdfw> show system alarms 1 alarms currently active Alarm time Class Description 2021-10-21 22:36:50 UTC Major PEM 1 Input Vol... [14:08:08] 10SRE, 10ops-codfw, 10DC-Ops: codfw Related Netbox Errors - https://phabricator.wikimedia.org/T294158 (10Papaul) 05Open→03Resolved Complete [14:13:11] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Data-Persistence: Degraded RAID on ms-be2028 - https://phabricator.wikimedia.org/T294001 (10Papaul) 05Open→03Resolved @MatthewVernon disk replaced [14:15:14] (03PS2) 10Jbond: puppetboard: add puppetboard as an active/active service [puppet] - 10https://gerrit.wikimedia.org/r/734263 [14:15:21] (03PS3) 10Jbond: puppetboard: add puppetboard as an active/active service [puppet] - 10https://gerrit.wikimedia.org/r/734263 [14:20:16] RECOVERY - HP RAID on ms-be2028 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [14:21:29] (03CR) 10Alexandros Kosiaris: [C: 04-1] Rename main cluster to services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/725003 (owner: 10Alexandros Kosiaris) [14:22:36] 10Puppet, 10Infrastructure-Foundations, 10GitLab (Infrastructure), 10Patch-For-Review, and 3 others: Puppetise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 (10Dzahn) >>! In T283076#7454868, @Jelto wrote: > So we have to make sure GitLab is not started by puppet agent runs during the... [14:23:43] (03PS1) 10Filippo Giunchedi: Revert "discovery: move read traffic to graphite2003" [dns] - 10https://gerrit.wikimedia.org/r/734277 (https://phabricator.wikimedia.org/T247963) [14:25:02] (03PS1) 10Filippo Giunchedi: Revert "statsd: failover writes to graphite2003" [puppet] - 10https://gerrit.wikimedia.org/r/734278 (https://phabricator.wikimedia.org/T247963) [14:25:04] (03PS1) 10Filippo Giunchedi: Revert "monitoring: check graphite2003 metrics" [puppet] - 10https://gerrit.wikimedia.org/r/734279 (https://phabricator.wikimedia.org/T247963) [14:25:06] 10SRE, 10ops-eqdfw: cr2-eqdfw: PEM 1 Input Voltage Out Of Range flapping - https://phabricator.wikimedia.org/T294009 (10Papaul) okay [14:25:30] (03PS1) 10Filippo Giunchedi: Revert "wmnet: move writes to graphite2003" [dns] - 10https://gerrit.wikimedia.org/r/734280 (https://phabricator.wikimedia.org/T247963) [14:26:09] (03PS1) 10Filippo Giunchedi: Revert "ProductionServices: use graphite2003 for statsd" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734281 (https://phabricator.wikimedia.org/T247963) [14:27:19] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: add beta logstash profile [puppet] - 10https://gerrit.wikimedia.org/r/727627 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [14:29:16] RECOVERY - Device not healthy -SMART- on ms-be2028 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2028&var-datasource=codfw+prometheus/ops [14:30:47] (03PS1) 10Jbond: cas templates: update templatre locations [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/734283 [14:31:11] !log Deploy schema change on s3 codfw - T291719 [14:31:16] (03CR) 10Jbond: [V: 03+2 C: 03+2] cas templates: update templatre locations [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/734283 (owner: 10Jbond) [14:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:20] T291719: Remove abuse_filter_log.afl_filter column and adjust schema consequently from Wikimedia production - https://phabricator.wikimedia.org/T291719 [14:35:39] (03CR) 10Alexandros Kosiaris: [C: 04-1] "LGTM. 1 nit and 1 answer inline." [puppet] - 10https://gerrit.wikimedia.org/r/732844 (owner: 10Ahmon Dancy) [14:37:42] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Data-Persistence: Degraded RAID on ms-be2028 - https://phabricator.wikimedia.org/T294001 (10MatthewVernon) @Papaul thanks :) [14:44:24] (03PS6) 10Dzahn: wikistats: pass php_version parameter to web class to support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/733092 [14:44:54] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Papaul) @Dzahn I need mw2253 and contint2001 down for me to reset the IDRAC befor... [14:45:26] !log update cas package [14:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:52] (03CR) 10jerkins-bot: [V: 04-1] wikistats: pass php_version parameter to web class to support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/733092 (owner: 10Dzahn) [14:46:55] 10SRE, 10Infrastructure-Foundations, 10netops, 10observability, 10Sustainability (Incident Followup): Alert that should have paged did not reach VictorOps because of partial networking outage - https://phabricator.wikimedia.org/T294166 (10herron) To clarify, the alert did make it to VO after a delay http... [14:47:07] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Data-Persistence: Degraded RAID on ms-be2028 - https://phabricator.wikimedia.org/T294001 (10Dzahn) 14:29 <+icinga-wm> RECOVERY - Device not healthy -SMART- on ms-be2028 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts... [14:48:44] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2253.codfw.wmnet [14:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:06] !log depooling mw2253 for DRAC upgrade (T283582) [14:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:13] T283582: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 [14:50:05] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on mw2253.codfw.wmnet with reason: DRAC upgrade [14:50:05] (03PS1) 10Jbond: logout: move logout file to correct location [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/734287 [14:50:06] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on mw2253.codfw.wmnet with reason: DRAC upgrade [14:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:25] (03CR) 10Jbond: [V: 03+2 C: 03+2] logout: move logout file to correct location [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/734287 (owner: 10Jbond) [14:52:32] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Dzahn) @Papaul mw2253 is not a problem. done. it's shut down and downtimed. cont... [14:53:07] (03PS1) 10KartikMistry: Update cxserver to 2021-10-25-123807-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/734288 (https://phabricator.wikimedia.org/T217747) [14:54:29] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Patch Telxius transport cross-connect to cr1-eqiad - https://phabricator.wikimedia.org/T293709 (10ayounsi) p:05Medium→03High Could you try to roll that fiber? Telxius is not receiving light from our side. Nor the other way around. [14:56:23] !log mw2253 - shut down and downtimed for 2 days [14:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:59] (03PS1) 10Jbond: idp.wikimedia.org: update record tyo make idp2001 the live instance [dns] - 10https://gerrit.wikimedia.org/r/734290 [14:58:31] would someone be kind enough to update the topic to reflect that I am on clinic duty this week? thank you [15:01:42] Reedy: much appreciated <3 [15:07:54] (03PS1) 10Majavah: toolforge: Update to ingress-nginx v1.0 [puppet] - 10https://gerrit.wikimedia.org/r/734294 (https://phabricator.wikimedia.org/T292771) [15:09:07] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Cmjohnson) we have a new install script and it's not working for this server. This is the error I am getting cmjohnson@cumin1001:~$ sudo cookbook sre.hosts.reima... [15:09:29] 10SRE, 10Traffic, 10observability, 10Discovery-Search (Current work), 10Patch-For-Review: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10MPhamWMF) [15:14:33] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Patch Telxius transport cross-connect to cr1-eqiad - https://phabricator.wikimedia.org/T293709 (10RobH) [15:15:30] (03PS2) 10Ahmon Dancy: docker: Mostly documentation updates [puppet] - 10https://gerrit.wikimedia.org/r/732844 [15:16:32] (03PS3) 10Ahmon Dancy: docker: Mostly documentation updates [puppet] - 10https://gerrit.wikimedia.org/r/732844 [15:17:02] (03CR) 10Ahmon Dancy: docker: Mostly documentation updates (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/732844 (owner: 10Ahmon Dancy) [15:19:21] (03PS2) 10Volans: Add drmrs to DNS (Netbox generated records) [dns] - 10https://gerrit.wikimedia.org/r/734208 (https://phabricator.wikimedia.org/T282787) [15:20:04] (03PS1) 10BBlack: wmnet: make edge DC layout more-obvious [dns] - 10https://gerrit.wikimedia.org/r/734296 (https://phabricator.wikimedia.org/T282787) [15:20:06] (03CR) 10Volans: "addressed comments" [dns] - 10https://gerrit.wikimedia.org/r/734208 (https://phabricator.wikimedia.org/T282787) (owner: 10Volans) [15:20:08] (03PS1) 10BBlack: Add drmrs to DNS (Netbox generated records) [dns] - 10https://gerrit.wikimedia.org/r/734297 (https://phabricator.wikimedia.org/T282787) [15:20:34] PROBLEM - SSH on thumbor1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:21:10] (03CR) 10Ayounsi: [C: 03+1] Add drmrs to DNS (Netbox generated records) [dns] - 10https://gerrit.wikimedia.org/r/734208 (https://phabricator.wikimedia.org/T282787) (owner: 10Volans) [15:21:42] heh [15:22:09] sorry, I didn't realize you were working on this too, we were talking about it elsewhere! [15:22:15] (03CR) 10Btullis: [C: 03+2] Move analytics-hive to an-coord1002 [dns] - 10https://gerrit.wikimedia.org/r/734260 (owner: 10Btullis) [15:24:45] (03PS1) 10Lucas Werkmeister (WMDE): Empty wikibase disabled access entity types on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734298 (https://phabricator.wikimedia.org/T294159) [15:24:51] (03CR) 10BBlack: [C: 03+1] "I'll rework the pre-patch I had elsewhere to go after this, please merge! :)" [dns] - 10https://gerrit.wikimedia.org/r/734208 (https://phabricator.wikimedia.org/T282787) (owner: 10Volans) [15:25:36] sorry for the trouble bblack, this was mostly Arzhel's work, I just did the last bit of the patch because he's on a train :) [15:25:45] (03PS3) 10Volans: Add drmrs to DNS (Netbox generated records) [dns] - 10https://gerrit.wikimedia.org/r/734208 (https://phabricator.wikimedia.org/T282787) [15:26:15] (03Abandoned) 10BBlack: Add drmrs to DNS (Netbox generated records) [dns] - 10https://gerrit.wikimedia.org/r/734297 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [15:26:47] (03CR) 10Lucas Werkmeister (WMDE): "> https://integration.wikimedia.org/ci/job/operations-mw-config-php72-composer-diffConfig-docker/8732/console : SUCCESS Please carefully r" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734298 (https://phabricator.wikimedia.org/T294159) (owner: 10Lucas Werkmeister (WMDE)) [15:27:15] (03CR) 10Volans: [C: 03+2] Add drmrs to DNS (Netbox generated records) [dns] - 10https://gerrit.wikimedia.org/r/734208 (https://phabricator.wikimedia.org/T282787) (owner: 10Volans) [15:28:10] (03PS2) 10BBlack: wmnet: make edge DC layout more-obvious [dns] - 10https://gerrit.wikimedia.org/r/734296 (https://phabricator.wikimedia.org/T282787) [15:30:04] jan_drewniak: Your horoscope predicts another unfortunate Wikimedia Portals Update deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211025T1530). [15:30:22] (03CR) 10Volans: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/734296 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [15:32:18] (03CR) 10BBlack: [C: 03+2] wmnet: make edge DC layout more-obvious [dns] - 10https://gerrit.wikimedia.org/r/734296 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [15:34:57] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10Ottomata) [15:43:08] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734328 (https://phabricator.wikimedia.org/T128546) [15:44:27] (03CR) 10Alexandros Kosiaris: [C: 03+2] docker: Mostly documentation updates [puppet] - 10https://gerrit.wikimedia.org/r/732844 (owner: 10Ahmon Dancy) [15:44:35] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734328 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:44:49] (03PS2) 10Jbond: idp.wikimedia.org: update record tyo make idp2001 the live instance [dns] - 10https://gerrit.wikimedia.org/r/734290 [15:44:52] (03CR) 10Jbond: [C: 03+2] idp.wikimedia.org: update record tyo make idp2001 the live instance [dns] - 10https://gerrit.wikimedia.org/r/734290 (owner: 10Jbond) [15:46:38] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734328 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:46:49] !log upgrade cas/idp to 6.4.2 [15:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:28] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:734328| Bumping portals to master (T128546)]] (duration: 01m 54s) [15:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:35] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [15:49:42] 10SRE, 10Traffic, 10observability, 10Discovery-Search (Current work), 10Patch-For-Review: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10MPhamWMF) [15:52:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:52:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:40] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:734328| Bumping portals to master (T128546)]] (duration: 01m 52s) [15:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:47] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:01:05] !log mmandere@cumin2002 START - Cookbook sre.dns.netbox [16:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:20] Heyo, just fyi, I ran the portals deploy and got one apache host error: [16:01:21] 15:57:40 /usr/bin/sudo -u root -- /usr/local/sbin/check-and-restart-php php7.2-fpm 100 (ran as mwdeploy@mw2253.codfw.wmnet) returned [255]: ssh: connect to host mw2253.codfw.wmnet port 22: Connection timed out [16:02:30] jan_drewniak: looks like mw2253 is down for some maintenance, so that's expected [16:03:08] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:03:10] https://phabricator.wikimedia.org/T283582#7455143 [16:03:22] shouldn't it be set as pooled=inactive then? [16:04:11] (03CR) 10Jforrester: "Ah, right. Sure; shall we deploy?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734298 (https://phabricator.wikimedia.org/T294159) (owner: 10Lucas Werkmeister (WMDE)) [16:04:19] jouncebot: nowandnext [16:04:20] No deployments scheduled for the next 0 hour(s) and 55 minute(s) [16:04:20] In 0 hour(s) and 55 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211025T1700) [16:04:42] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Empty wikibase disabled access entity types on Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734298 (https://phabricator.wikimedia.org/T294159) (owner: 10Lucas Werkmeister (WMDE)) [16:04:48] !log mmandere@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:04] ^ I’ll sync that config change once it merges (though it’s beta-only anyways) [16:05:10] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:05:54] (03Merged) 10jenkins-bot: Empty wikibase disabled access entity types on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734298 (https://phabricator.wikimedia.org/T294159) (owner: 10Lucas Werkmeister (WMDE)) [16:07:00] (03PS1) 10Jbond: cas: update config property names [puppet] - 10https://gerrit.wikimedia.org/r/734338 [16:08:11] (03PS2) 10Accraze: ml-services: add enwiki-articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/733034 (https://phabricator.wikimedia.org/T294141) [16:08:15] (03CR) 10Jbond: [C: 03+2] cas: update config property names [puppet] - 10https://gerrit.wikimedia.org/r/734338 (owner: 10Jbond) [16:08:30] Lucas_WMDE: +1 [16:08:51] hmm, 1 apaches had sync errors [16:08:52] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:734298|Empty wikibase disabled access entity types on Beta (T294159)]] (beta-only) (duration: 01m 47s) [16:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:00] T294159: Enable Lexeme access on first set of projects - https://phabricator.wikimedia.org/T294159 [16:09:02] mw2253 had connection timeout [16:09:04] Lucas_WMDE: You made manual changes on the box? [16:09:09] on the scap pull and also the php-fpm restart [16:09:09] Or have you reverted those? [16:09:21] James_F: on the beta cluster, not in production [16:09:23] (03PS1) 10AOkoth: gitlab: disable puppet and rename files [puppet] - 10https://gerrit.wikimedia.org/r/734339 (https://phabricator.wikimedia.org/T283076) [16:09:30] I didn’t touch mw2253, that must be something else [16:09:30] Oh, yeah, never mind. [16:09:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:44] ah, mutante shut it down for 2 days ^^ [16:09:46] mw2253 is shut down and depooled, sorry , my bad [16:09:49] https://sal.toolforge.org/log/cZvzt3wB8Fs0LHO51aT1 [16:09:50] fixing [16:09:51] (03CR) 10Accraze: "Thanks for the reviews! I have fixed the helmfile name, so things should be good to go 😎" [deployment-charts] - 10https://gerrit.wikimedia.org/r/733034 (https://phabricator.wikimedia.org/T294141) (owner: 10Accraze) [16:09:52] Yeah. [16:09:54] ok thanks :) [16:10:01] was about to ask if scap should still be trying to SSH to it then [16:10:10] no, it should not [16:10:10] but apart from that the sync went fine apparently [16:10:17] It shouldn't be in the dsh group any more. [16:10:18] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2253.codfw.wmnet [16:10:21] ^ [16:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:25] thx! [16:10:34] Awesome. [16:10:43] I set it to =no but not =inactive [16:10:47] and the difference is that [16:12:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:12] (03CR) 10AOkoth: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31879/" [puppet] - 10https://gerrit.wikimedia.org/r/734339 (https://phabricator.wikimedia.org/T283076) (owner: 10AOkoth) [16:14:26] (03CR) 10Elukey: [C: 03+2] ml-services: add enwiki-articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/733034 (https://phabricator.wikimedia.org/T294141) (owner: 10Accraze) [16:16:48] PROBLEM - Host mw2253.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:17:09] ^ ok, now this is WHY i took it down, maintenance on mgmt [16:17:29] well, there is yet another level to it [16:17:39] the maintenance would be to stop the flapping alerts like this one :) [16:17:40] (03PS2) 10AOkoth: gitlab: disable puppet and rename files [puppet] - 10https://gerrit.wikimedia.org/r/734339 (https://phabricator.wikimedia.org/T283076) [16:18:01] DRAC firmware upgrades fix PING timeouts [16:18:33] (03PS1) 10Btullis: Revert the active hive server to an-coord1001 [dns] - 10https://gerrit.wikimedia.org/r/734342 [16:20:37] (03PS1) 10Jbond: cas.properties:rename u2f.expire-devices key [puppet] - 10https://gerrit.wikimedia.org/r/734343 [16:21:37] !log mmandere@cumin2002 START - Cookbook sre.dns.netbox [16:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:54] RECOVERY - Host mw2253.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.41 ms [16:23:30] (03CR) 10Jbond: [C: 03+2] cas.properties:rename u2f.expire-devices key [puppet] - 10https://gerrit.wikimedia.org/r/734343 (owner: 10Jbond) [16:23:48] (03CR) 10Dzahn: "Yea, so what happened at https://phabricator.wikimedia.org/T151642#7452946 is exactly what I meant with my warning above to check if it wo" [puppet] - 10https://gerrit.wikimedia.org/r/732968 (https://phabricator.wikimedia.org/T151642) (owner: 10Hashar) [16:23:57] /win 7 [16:24:04] ufff [16:24:24] 7 is a low number, must be one of the main channels [16:24:36] wikimedia-ml :) [16:24:41] :) [16:25:08] !log mmandere@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:29] (03PS1) 10Jbond: cas: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/734345 [16:25:39] (03CR) 10Jbond: [V: 03+2 C: 03+2] cas: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/734345 (owner: 10Jbond) [16:25:40] !log accraze@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [16:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:45] (03PS1) 10Ayounsi: Add eqsin-ulsfo transport v6 [dns] - 10https://gerrit.wikimedia.org/r/734346 (https://phabricator.wikimedia.org/T273308) [16:28:09] !log mmandere@cumin2002 START - Cookbook sre.dns.netbox [16:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:53] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/734346 (https://phabricator.wikimedia.org/T273308) (owner: 10Ayounsi) [16:31:45] !log mmandere@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:07] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Production Shell Groups & JupyterHub for echetty - https://phabricator.wikimedia.org/T294229 (10EChetty) >>! In T294229#7454787, @Ottomata wrote: > Approved. Emil will need a Kerberos ticket as well. I had this ticket open but it was ma... [16:32:57] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Production Shell Groups & JupyterHub for echetty - https://phabricator.wikimedia.org/T294229 (10Ottomata) Naw, not wrong, its just easier to track access requests for one person in a single ticket. [16:35:02] (03CR) 10Hashar: zuul: use releng list rather than jenkins-bot for email (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732968 (https://phabricator.wikimedia.org/T151642) (owner: 10Hashar) [16:36:30] (03CR) 10Ayounsi: [C: 03+2] Add eqsin-ulsfo transport v6 [dns] - 10https://gerrit.wikimedia.org/r/734346 (https://phabricator.wikimedia.org/T273308) (owner: 10Ayounsi) [16:36:56] !log DNS - Add eqsin-ulsfo transport v6 prefix - T273308 [16:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:54] (03CR) 10Dzahn: zuul: use releng list rather than jenkins-bot for email (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732968 (https://phabricator.wikimedia.org/T151642) (owner: 10Hashar) [16:49:54] !log update management routers ACLs [16:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:26] (03PS1) 10Legoktm: Update footer links [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/734350 (https://phabricator.wikimedia.org/T199812) [16:54:09] (03PS1) 10Legoktm: Link to Libera Chat's official webchat [software/klaxon] - 10https://gerrit.wikimedia.org/r/734351 [17:00:05] ryankemper: Dear deployers, time to do the Wikidata Query Service weekly deploy deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211025T1700). [17:01:07] 10SRE, 10Release-Engineering-Team, 10serviceops: schedule downtime for contint2001 - https://phabricator.wikimedia.org/T294271 (10Dzahn) [17:01:30] 10SRE, 10Release-Engineering-Team, 10serviceops: schedule downtime for contint2001 - https://phabricator.wikimedia.org/T294271 (10Dzahn) [17:01:40] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Dzahn) [17:02:41] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Patch Telxius transport cross-connect to cr1-eqiad - https://phabricator.wikimedia.org/T293709 (10RobH) [17:02:49] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Dzahn) @Papaul Let's go ahead with mw2253. For contint2001 please consider it sta... [17:03:50] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Patch Telxius transport cross-connect to cr1-eqiad - https://phabricator.wikimedia.org/T293709 (10RobH) a:05Cmjohnson→03Jclark-ctr [17:04:00] 10SRE, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10serviceops: schedule downtime for contint2001 - https://phabricator.wikimedia.org/T294271 (10hashar) contint2001.wikimedia.org is indeed the primary for CI (Jenkins and Zuul). We could switch over to the other host but the runboo... [17:07:03] (03PS1) 10BryanDavis: toolhub: Bump container version to 2021-10-25-160227-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/734355 (https://phabricator.wikimedia.org/T294072) [17:15:37] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Patch Telxius transport cross-connect to cr1-eqiad - https://phabricator.wikimedia.org/T293709 (10Jclark-ctr) [17:16:12] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Patch Telxius transport cross-connect to cr1-eqiad - https://phabricator.wikimedia.org/T293709 (10Jclark-ctr) using light meter no light. from ports 31/32 on patch panel [17:16:14] (03CR) 10BryanDavis: [C: 03+2] toolhub: Bump container version to 2021-10-25-160227-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/734355 (https://phabricator.wikimedia.org/T294072) (owner: 10BryanDavis) [17:16:42] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Patch Telxius transport cross-connect to cr1-eqiad - https://phabricator.wikimedia.org/T293709 (10Jclark-ctr) a:05Jclark-ctr→03RobH [17:20:10] !log mmandere@cumin2002 START - Cookbook sre.dns.netbox [17:20:12] (03Merged) 10jenkins-bot: toolhub: Bump container version to 2021-10-25-160227-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/734355 (https://phabricator.wikimedia.org/T294072) (owner: 10BryanDavis) [17:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:34] RECOVERY - SSH on thumbor1001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:22:37] !log update core routers ACLs [17:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:14] !log bd808@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'toolhub' for release 'main' . [17:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:08] !log mmandere@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:34] (03CR) 10CDanis: [C: 03+2] "Awesome, thank you!" [software/klaxon] - 10https://gerrit.wikimedia.org/r/734351 (owner: 10Legoktm) [17:25:33] (03Merged) 10jenkins-bot: Link to Libera Chat's official webchat [software/klaxon] - 10https://gerrit.wikimedia.org/r/734351 (owner: 10Legoktm) [17:25:37] 10SRE, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10serviceops: schedule downtime for contint2001 - https://phabricator.wikimedia.org/T294271 (10Dzahn) < mutante> then let's just tell @Papaul what time is ok, basically < mutante> or a time where all can be around with him in DC <+... [17:26:22] (03PS1) 10AntiCompositeNumber: explicitly declare 2 message dependencies in wikibase.mediainfo.statements [extensions/WikibaseMediaInfo] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/734310 (https://phabricator.wikimedia.org/T286297) [17:26:32] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Patch Telxius transport cross-connect to cr1-eqiad - https://phabricator.wikimedia.org/T293709 (10RobH) [17:26:57] !log bd808@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'toolhub' for release 'main' . [17:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:41] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Papaul) [17:27:55] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Papaul) [17:28:35] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Papaul) @Dzahn mw2253 done [17:29:06] (03PS1) 10Odder: Add mobile wordmark for Meetei (Manipuri) Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734359 (https://phabricator.wikimedia.org/T294189) [17:30:17] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics Private Data Users for Naray-ctr - https://phabricator.wikimedia.org/T293810 (10Ottomata) [17:32:22] !log bd808@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'toolhub' for release 'main' . [17:32:26] 10SRE, 10Wikimedia-Mailing-lists: Delete "releng" mailman account - https://phabricator.wikimedia.org/T294270 (10Legoktm) p:05Triage→03Lowest [17:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:56] (03PS1) 10Odder: Add mobile wordmark for Meetei (Manipuri) Wikipedia to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734361 (https://phabricator.wikimedia.org/T294189) [17:39:51] !log mw2253 - scap pull after hw maintenance is over [17:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:01] (03CR) 10jerkins-bot: [V: 04-1] Add mobile wordmark for Meetei (Manipuri) Wikipedia to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734361 (https://phabricator.wikimedia.org/T294189) (owner: 10Odder) [17:40:08] jouncebot: now [17:40:08] No deployments scheduled for the next 0 hour(s) and 19 minute(s) [17:40:13] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2253.codfw.wmnet [17:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:25] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2253.codfw.wmnet [17:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:41] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Dzahn) @Papaul. Thank you! - scap pulled - confirmed icinga green - repooled to... [17:45:10] (03PS3) 10Zabe: Fix array declaration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732840 (https://phabricator.wikimedia.org/T197058) [17:45:12] (03PS2) 10Zabe: Fix some easy codestyle issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732971 [17:47:39] (03PS2) 10Odder: Add mobile wordmark for Meetei (Manipuri) Wikipedia to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734361 (https://phabricator.wikimedia.org/T294189) [17:48:30] (03CR) 10jerkins-bot: [V: 04-1] Add mobile wordmark for Meetei (Manipuri) Wikipedia to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734361 (https://phabricator.wikimedia.org/T294189) (owner: 10Odder) [17:48:32] that mw2253 is back in production (and should not alert anymore on mgmt) [17:48:54] (03CR) 10Legoktm: [C: 03+2] package_builder: Add hook to stop rebuilding man-db [puppet] - 10https://gerrit.wikimedia.org/r/732383 (https://phabricator.wikimedia.org/T276632) (owner: 10Legoktm) [17:49:42] (03PS4) 10Legoktm: package_builder: Refactor PHP hook into a template [puppet] - 10https://gerrit.wikimedia.org/r/732098 [17:51:06] (03CR) 10jerkins-bot: [V: 04-1] explicitly declare 2 message dependencies in wikibase.mediainfo.statements [extensions/WikibaseMediaInfo] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/734310 (https://phabricator.wikimedia.org/T286297) (owner: 10AntiCompositeNumber) [17:51:11] 10SRE, 10Infrastructure-Foundations, 10Packaging, 10Patch-For-Review: Disable man-db in pbuilder in package_builder on deneb - https://phabricator.wikimedia.org/T276632 (10Legoktm) 05Open→03Resolved ` ... Setting up man-db (2.8.5-2) ... Not building database; man-db/auto-update is not 'true'. ` [17:51:34] (03CR) 10Legoktm: [C: 03+2] package_builder: Refactor PHP hook into a template [puppet] - 10https://gerrit.wikimedia.org/r/732098 (owner: 10Legoktm) [17:51:56] (03CR) 10Jbond: "<3 thanks for this, minor nit, will merge and deploy tomorrow" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/734350 (https://phabricator.wikimedia.org/T199812) (owner: 10Legoktm) [17:52:32] (03CR) 10Dzahn: "Could you split this into 2 changes please? One just for the file renames and one for disabling puppet? I would like to discuss and merge " [puppet] - 10https://gerrit.wikimedia.org/r/734339 (https://phabricator.wikimedia.org/T283076) (owner: 10AOkoth) [17:53:55] (03PS2) 10Legoktm: Update footer links [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/734350 (https://phabricator.wikimedia.org/T199812) [17:54:04] (03CR) 10Legoktm: Update footer links (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/734350 (https://phabricator.wikimedia.org/T199812) (owner: 10Legoktm) [17:55:02] (03CR) 10Dzahn: [C: 04-1] gitlab: disable puppet and rename files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/734339 (https://phabricator.wikimedia.org/T283076) (owner: 10AOkoth) [17:57:02] (03CR) 10Krinkle: [C: 04-1] "Would it work to assign this kind of override from CommonSettings-labs.php instead?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731803 (https://phabricator.wikimedia.org/T132274) (owner: 10Ottomata) [17:57:29] (03CR) 10Dzahn: [C: 04-1] gitlab: disable puppet and rename files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/734339 (https://phabricator.wikimedia.org/T283076) (owner: 10AOkoth) [18:00:05] RoanKattouw and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211025T1800). [18:00:05] MatmaRex, zabe, and AntiComposite: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:13] hiii [18:00:16] o/ [18:00:39] o/ [18:04:00] (03CR) 10Legoktm: [C: 03+1] "LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/732642 (https://phabricator.wikimedia.org/T293996) (owner: 10Giuseppe Lavagetto) [18:04:27] (03PS1) 10Ottomata: Hive - set hive.warehouse.subdir.inherit.perms = false [puppet] - 10https://gerrit.wikimedia.org/r/734368 (https://phabricator.wikimedia.org/T291664) [18:05:07] now we just need a deployer :) [18:06:52] I can deploy if there's no-one else. [18:06:59] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31880/console" [puppet] - 10https://gerrit.wikimedia.org/r/734368 (https://phabricator.wikimedia.org/T291664) (owner: 10Ottomata) [18:07:18] (03PS2) 10Jforrester: Make reply tool available as opt-out on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732254 (https://phabricator.wikimedia.org/T293687) (owner: 10Bartosz Dziewoński) [18:07:22] (03CR) 10Jforrester: [C: 03+2] Make reply tool available as opt-out on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732254 (https://phabricator.wikimedia.org/T293687) (owner: 10Bartosz Dziewoński) [18:07:50] (03CR) 10Legoktm: [C: 03+1] logging: Remove host 'ip' field [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730905 (https://phabricator.wikimedia.org/T114700) (owner: 10Krinkle) [18:08:06] legoktm: Should that ip one go out now? [18:08:40] (03Merged) 10jenkins-bot: Make reply tool available as opt-out on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732254 (https://phabricator.wikimedia.org/T293687) (owner: 10Bartosz Dziewoński) [18:08:49] (03CR) 10Jforrester: [C: 03+2] explicitly declare 2 message dependencies in wikibase.mediainfo.statements [extensions/WikibaseMediaInfo] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/734310 (https://phabricator.wikimedia.org/T286297) (owner: 10AntiCompositeNumber) [18:09:08] MatmaRex: Live on mwdebug1002; can you check? [18:09:55] looking [18:10:38] James_F: seems good [18:10:55] Cool, syncing now. [18:11:54] !log jforrester@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:732254|Make reply tool available as opt-out on frwiki (T293687)]] (duration: 00m 56s) [18:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:05] T293687: Config change: Deploy Reply Tool as opt-out preference at French Wikipedia - https://phabricator.wikimedia.org/T293687 [18:12:16] (03PS2) 10Jforrester: flaggedrevs: Drop legacy wgFlaggedRevsStatsAge config, no longer read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732836 (owner: 10Zabe) [18:12:32] (03PS3) 10Jforrester: flaggedrevs: Drop legacy wgFlaggedRevsStatsAge config, no longer read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732836 (owner: 10Zabe) [18:12:36] (03CR) 10Jforrester: [C: 03+2] flaggedrevs: Drop legacy wgFlaggedRevsStatsAge config, no longer read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732836 (owner: 10Zabe) [18:13:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:17] (03CR) 10Dzahn: [C: 04-1] gitlab: disable puppet and rename files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/734339 (https://phabricator.wikimedia.org/T283076) (owner: 10AOkoth) [18:13:32] AntiComposite: I've C+2'ed your one and am doing the config ones whilst waiting for CI, just in case you're worried. [18:13:48] (03Merged) 10jenkins-bot: flaggedrevs: Drop legacy wgFlaggedRevsStatsAge config, no longer read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732836 (owner: 10Zabe) [18:13:56] yup, I saw [18:14:55] (thanks) [18:15:04] MatmaRex: Any time. :-) [18:15:05] (03CR) 10Ahmon Dancy: gitlab: disable puppet and rename files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/734339 (https://phabricator.wikimedia.org/T283076) (owner: 10AOkoth) [18:15:45] !log jforrester@deploy1002 Synchronized wmf-config/flaggedrevs.php: Config: [[gerrit:732836|flaggedrevs: Drop legacy wgFlaggedRevsStatsAge config, no longer read]] (duration: 00m 55s) [18:15:47] (03PS4) 10Jforrester: Fix array declaration of NS_USER_TALK abbreviation on ruwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732840 (https://phabricator.wikimedia.org/T197058) (owner: 10Zabe) [18:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:52] (03PS5) 10Jforrester: Fix array declaration of NS_USER_TALK abbreviation on ruwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732840 (https://phabricator.wikimedia.org/T197058) (owner: 10Zabe) [18:16:10] (03CR) 10Jforrester: [C: 03+2] "The sooner this can be made an actual static file the better. :-(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732840 (https://phabricator.wikimedia.org/T197058) (owner: 10Zabe) [18:16:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:43] (03Merged) 10jenkins-bot: Fix array declaration of NS_USER_TALK abbreviation on ruwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732840 (https://phabricator.wikimedia.org/T197058) (owner: 10Zabe) [18:19:27] (03PS3) 10Jforrester: Fix some easy codestyle issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732971 (owner: 10Zabe) [18:19:51] !log jforrester@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:732840|Fix array declaration of NS_USER_TALK abbreviation on ruwikiquote (T197058)]] (duration: 00m 55s) [18:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:57] T197058: Abbreviations for namespaces to ruwikiquote - https://phabricator.wikimedia.org/T197058 [18:20:08] (03CR) 10Jforrester: [C: 03+2] Fix some easy codestyle issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732971 (owner: 10Zabe) [18:21:05] (03Merged) 10jenkins-bot: Fix some easy codestyle issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732971 (owner: 10Zabe) [18:21:52] (03PS3) 10Dzahn: rsync::quickdatacopy: add option to exclude some files [puppet] - 10https://gerrit.wikimedia.org/r/733083 (https://phabricator.wikimedia.org/T294080) [18:22:11] !log jforrester@deploy1002 Synchronized w/static.php: Config: [[gerrit:732971|Fix some easy codestyle issues]] (duration: 00m 54s) [18:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:26] James_F: I assumed K.rinkle would roll it out whenever he's ready [18:22:47] Ack, will leave. [18:23:58] 10SRE: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183 (10RobH) [18:24:17] 10SRE, 10DC-Ops: documented procedure for replacing disks in software RAID servers - https://phabricator.wikimedia.org/T220842 (10RobH) 05Open→03Resolved a:03RobH [18:24:45] !log jforrester@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:732971|Fix some easy codestyle issues]] (duration: 00m 55s) [18:24:47] I will have a wmf.5 backport once Jenkins finishes in a few minutes, but that can wait until the scheduled patches are done [18:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:06] legoktm: Sure; there's a wmf.5 backport going through gate already, just pile it on. [18:25:18] (03PS1) 10Legoktm: Input may be null when rendering a self-closing tag `` [extensions/timeline] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/734312 (https://phabricator.wikimedia.org/T294020) [18:25:32] Thanks James_F :) [18:25:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:48] zabe: Thank you! [18:25:56] PROBLEM - Host db1112 is DOWN: PING CRITICAL - Packet loss = 100% [18:26:02] (03CR) 10Legoktm: [C: 03+2] Input may be null when rendering a self-closing tag `` [extensions/timeline] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/734312 (https://phabricator.wikimedia.org/T294020) (owner: 10Legoktm) [18:27:49] * James_F twiddles thumbs. [18:28:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:42] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=204 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:30:13] hmm [18:30:25] Gotta love the selenium tests. [18:30:27] Could it be the DB or the deploy [18:30:42] The db is an s3 sanitarium master [18:31:05] (03CR) 10jerkins-bot: [V: 04-1] explicitly declare 2 message dependencies in wikibase.mediainfo.statements [extensions/WikibaseMediaInfo] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/734310 (https://phabricator.wikimedia.org/T286297) (owner: 10AntiCompositeNumber) [18:31:22] Unlikely to be the deploy I think? We didn't enable anything scary for s3. [18:31:47] (03CR) 10Ahmon Dancy: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/605343 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [18:31:49] that's not a big difference in latency to alert, and it's "during sync" , right? [18:31:59] mutante: ye [18:32:03] just a few ms more and it already does this [18:32:10] James_F: the DB is completely offline so will be another issue [18:32:18] Just happened the same time as the alert [18:32:19] Ack. [18:33:33] AntiComposite: :-( [18:33:49] looking at the 7 day history I do see some ocassional spikes like this [18:33:53] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-11), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 (10NRodriguez) Hello there @Legoktm we just resolved https://phabricator.wikimedia.org/T290731#745546... [18:33:55] (of the latency alert0 [18:34:00] (03Merged) 10jenkins-bot: Input may be null when rendering a self-closing tag `` [extensions/timeline] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/734312 (https://phabricator.wikimedia.org/T294020) (owner: 10Legoktm) [18:34:39] (03CR) 10Jforrester: [C: 03+2] "Failed GrowthExperiments selenium test again…" [extensions/WikibaseMediaInfo] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/734310 (https://phabricator.wikimedia.org/T286297) (owner: 10AntiCompositeNumber) [18:35:01] legoktm: OK for me to sync the EasyTimeline change? [18:35:02] legoktm: ACK, was about to say the same thing. when zooming out it does not look uncommon. Occassionally it happens during the sync [18:35:06] James_F: yep [18:35:52] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:36:07] that was a 15% increase [18:36:10] that made it trigger [18:36:17] call it trigger-happy or not [18:38:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:11] James_F, yeah, Jenkins seems to really hate this patch for some reason, it didn't want to merge it to master the first two times either :( [18:40:05] AntiComposite: Possibly the WBMI hooks make a race condition worse? But eh. [18:40:33] !log jforrester@deploy1002 Synchronized php-1.38.0-wmf.5/extensions/timeline/includes/Timeline.php: Backport: [[gerrit:734312|Input may be null when rendering a self-closing tag `` (T294020)]] (duration: 00m 55s) [18:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:41] T294020: TypeError: Argument 2 passed to Shellbox\Command\BoxedCommand::inputFileFromString() must be of the type string, null given, called in /srv/mediawiki/php-1.38.0-wmf.4/extensions/timeline/includes/Timeline.php on line 157 - https://phabricator.wikimedia.org/T294020 [18:41:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:39] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1003/31881/grafana2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/733083 (https://phabricator.wikimedia.org/T294080) (owner: 10Dzahn) [18:43:13] * James_F twiddles thumbs more. [18:45:44] jouncebot: make it happen [18:46:21] James_F: you are aware of this? 18:45 < Reedy> !log Reloading Zuul to deploy [18:46:34] or are these 2 different things [18:47:05] It happens. [18:47:30] The builds are still running on Jenkins. [18:47:41] confirmed the timeline fix [18:47:45] thanks James_F (and MatmaRex!) [18:47:52] Of course. [18:50:14] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=204 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:51:11] that is the same kind of about 15% increase as ealier [18:51:50] 210 is baseline and icinga-wm does not like 240 but that is normal during sync..or so [18:52:09] Yeah, but we've not synced for 10 minutes… [18:53:06] It matches against https://sal.toolforge.org/log/nJyuuHwB8Fs0LHO5IJ3U but if anything that should have improved performance if anything (fixing a typo). [18:54:22] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=204 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:55:16] it only spikes on some alerts [18:55:35] er, spikes on some syncs* [18:56:01] yea, confirmed [18:56:02] It's more that it's gone from 210 to 240 at 18:20 and essentially stayed there. [18:56:16] Shouldn't it have reverted to mean by now? [18:56:42] there's a small corresponding dip in php-fpm workers https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=54&orgId=1 [18:57:05] * legoktm peeks at slow log [18:57:44] 10SRE, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ganeti202[78].codfw.wmnet - https://phabricator.wikimedia.org/T294139 (10RobH) [18:58:18] legoktm: we have a db down right before the alert [18:58:24] Could things be out of balance [18:58:24] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:58:26] https://phabricator.wikimedia.org/T294295 [18:58:53] That should if anything make prod faster, I'd have thought? [18:59:14] oh wtf [18:59:31] that's the s3 contribs/rc db replica [18:59:33] James_F: my understanding of DBs is not great [18:59:42] legoktm: also sanitarium [18:59:45] you said saniatrium tough [18:59:47] though [18:59:49] Oh right, it's used as a replica as well as sanitarium? [18:59:52] that would not be prod [18:59:59] I assumed those were non-prod, right. [19:00:20] let's check dbtree.wikimedia.org [19:00:25] https://noc.wikimedia.org/dbconfig/eqiad.json [19:00:35] (03CR) 10jerkins-bot: [V: 04-1] explicitly declare 2 message dependencies in wikibase.mediainfo.statements [extensions/WikibaseMediaInfo] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/734310 (https://phabricator.wikimedia.org/T286297) (owner: 10AntiCompositeNumber) [19:00:35] I went from https://github.com/wikimedia/puppet/blob/833ea5c0a697884ff55e2e4f0cf2961ad1fad207/hieradata/hosts/db1112.yaml#L2 [19:01:02] which means these slow queries are going to be sent to the normal replicas [19:01:13] Right. No wonder it's slow. [19:01:33] should we try a reboot via mgmt? [19:01:33] it's the master for the clouddb slaves [19:01:36] rc_logid again, that's twice in a row [19:01:52] yes, I can do that [19:02:03] AntiComposite: At this point I think we should just give up and have it ride the train, sorry. [19:02:10] actually we are supposed to follow the wikitech link...checking [19:02:18] James_F, Yeah, agreed [19:02:38] legoktm: trying mgmt login [19:02:59] (03Abandoned) 10Jforrester: explicitly declare 2 message dependencies in wikibase.mediainfo.statements [extensions/WikibaseMediaInfo] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/734310 (https://phabricator.wikimedia.org/T286297) (owner: 10AntiCompositeNumber) [19:03:19] James_F, thanks for your help [19:03:28] let me depool it [19:03:31] AntiComposite: Of course. Thank you for working to fix things. [19:04:00] legoktm: great, please do, I was going to look for the wikitech alert link that talked about them coming back with stopped service [19:04:24] am on console, no output [19:04:35] hey folks [19:04:36] !log legoktm@cumin1001 dbctl commit (dc=all): 'Depool db1112 (T294295)', diff saved to https://phabricator.wikimedia.org/P17596 and previous config saved to /var/cache/conftool/dbconfig/20211025-190436-legoktm.json [19:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:43] T294295: db1112 (s3 contribs/rc replica) is down - https://phabricator.wikimedia.org/T294295 [19:05:03] cant get out of console..arr [19:05:10] hi kormat, db1112 went down about ~35 min ago [19:05:11] now..performing power cycle [19:05:24] mutante, legoktm: i'll move those groups to a different db instance [19:05:49] kormat: should I try to powercycle, still? [19:06:19] mutante: sure, why not [19:06:29] !log db1112 - powercycling [19:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:17] !log kormat@cumin1001 dbctl commit (dc=all): 'Temporarily move mw groups to db1123 T294295', diff saved to https://phabricator.wikimedia.org/P17597 and previous config saved to /var/cache/conftool/dbconfig/20211025-190717-kormat.json [19:07:22] watching console, BIOS messages [19:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:25] latency has mostly recovered [19:07:43] it's also the sanitarium master for s3 :/ [19:08:17] yes, confirmed that in dbtree, master for clouddb [19:08:32] it's coming back..so far [19:08:48] RECOVERY - Host db1112 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [19:08:58] kormat: try ssh, it's back at login [19:09:20] with a.. new key...? [19:10:15] ssh host key was fine for me [19:10:16] I mean, I haven't logged in there in a long time or never, I just got the warning about offending key [19:10:19] ok [19:10:41] mutante: while you're on the console, any hw errors? [19:11:11] PROBLEM - MariaDB Replica SQL: s3 #page on db1112 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:11:18] PROBLEM - MariaDB read only s3 on db1112 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [19:11:37] * legoktm acks the page [19:11:37] 👋 [19:11:38] kormat: the boot messages looked all green and normal, it did an fsck, then continued normal. checking the DRAC level ow [19:11:38] why does that page but "host down" does not [19:11:54] 👋 [19:11:55] <_joe_> majavah: not relevant now [19:11:56] scrolling back, lmk if you need anything <3 [19:11:59] hey [19:12:00] <_joe_> should we depool the server? [19:12:07] _joe_: already done [19:12:08] <_joe_> oh alreayd done [19:12:12] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1112.eqiad.wmnet with reason: hardware fail [19:12:14] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1112.eqiad.wmnet with reason: hardware fail [19:12:15] downtimed [19:12:16] <_joe_> yeah sorry I am reading scrollback right now [19:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:18] _joe_: done [19:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:25] i've just put in a downtime for 24h [19:12:28] (well, 26h) [19:12:34] we should have downtimed befored rebooting, sorry, but it's back [19:12:39] same [19:13:07] <_joe_> majavah: we don't page on the host down because that's a very broad alert; we page on services being unavailable [19:13:48] ACKNOWLEDGEMENT - MariaDB Replica IO: s3 on db1154 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1112.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1112.eqiad.wmnet (110 Connection timed out) Kormat db1112 is down https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:13:48] ACKNOWLEDGEMENT - MariaDB Replica Lag: s3 on db1154 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2979.91 seconds Kormat db1112 is down https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:13:55] acking the checks on .. yeah that ^ [19:15:58] so I ran "racadm getsel" for hw troubleshooting [19:16:11] and there is a log line that is like broken RAM [19:16:17] but from 2 days ago [19:16:29] and called Non-Critical. and that's it [19:16:54] "Correctable memory error rate exceeded for DIMM_B1" [19:17:05] * jbond here [19:17:07] this type shows up on tickets where we end up replacing RAM though, fwiw [19:17:21] jbond: db went dow [19:17:23] N [19:17:28] jbond: it's handled, server broke but depooled [19:17:33] Alarms eventually went off [19:17:36] ack thx [19:18:22] alright. give that the server is depooled, and wikireplicas are non-critical, i'd suggest leaving it as-is, and i'll dig into it tomorrow [19:18:26] well, actually, it's simply back after powercycling it [19:18:36] 10ops-codfw, 10DC-Ops, 10Wikidata-Query-Service: (Need By: TBD) rack/setup/install wdqs20[09,10,11] - https://phabricator.wikimedia.org/T294297 (10RobH) [19:18:50] and the log is not really conclusive but Dell might still say that they replace the DIMM [19:19:15] sounds good, kormat [19:19:20] mutante: 👍 [19:19:20] 10ops-codfw, 10DC-Ops, 10Wikidata-Query-Service: Q2:(Need By: End of Q2) rack/setup/install wdqs20[09,10,11] - https://phabricator.wikimedia.org/T294297 (10RobH) [19:19:27] Should we send something to cloud-announce to let them know it's expected? [19:19:41] 10ops-codfw, 10DC-Ops, 10Wikidata-Query-Service: Q2:(Need By: End of Q2) rack/setup/install wdqs20[09,10,11] - https://phabricator.wikimedia.org/T294297 (10RobH) [19:19:55] 10ops-codfw, 10DC-Ops, 10Wikidata-Query-Service: Q2:(Need By: End of Q2) rack/setup/install wdqs20[09,10,11] - https://phabricator.wikimedia.org/T294297 (10RobH) a:03Papaul [19:20:34] was about to say I can make a ticket.. but there is: [19:20:35] https://phabricator.wikimedia.org/T294295 [19:20:48] so we can use that for comments [19:21:07] public UBN.. we should say something, would be good [19:21:09] thanks kormat [19:22:17] dropped the prio [19:23:04] Spookreeeno: sounds like a wmcs call, to me. i've no idea what norms they have these days. i know with the not-so-long-ago labsdb setup, they'd be lagging for days at a time. v0v [19:23:35] cool, thanks, commenting as well [19:24:34] yea, seems from here I would say hand-over to -wmcs / -cloud [19:25:14] cloud team is not around today [19:26:12] I don't think it needs an announcement if we expect it will be fixed tomorrow, while much rarer these days replag is still a thing that is just expected to exist sometimes [19:26:47] yup, I wouldn't expect more than a phab task [19:27:14] PROBLEM - MariaDB Replica Lag: s3 on clouddb1021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3799.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:27:34] PROBLEM - MariaDB Replica Lag: s3 on clouddb1013 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3818.71 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:27:34] PROBLEM - MariaDB Replica Lag: s3 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3819.38 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:28:33] ACKNOWLEDGEMENT - MariaDB Replica Lag: s3 on clouddb1013 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3818.71 seconds Legoktm s3 replication broken - T294295 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:28:33] ACKNOWLEDGEMENT - MariaDB Replica Lag: s3 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3819.38 seconds Legoktm s3 replication broken - T294295 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:28:33] ACKNOWLEDGEMENT - MariaDB Replica Lag: s3 on clouddb1021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3799.12 seconds Legoktm s3 replication broken - T294295 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:29:12] so far it looks to become one of those "it just shut down and now it's back up" cases we have every once in a while. where it's under the threshold for it to get hw replacement. and we probably could get it back into prod..or insist on the Dell debug procedure, but we'll see what DBA thinks [19:32:11] 10ops-codfw, 10DC-Ops, 10Kubernetes: Q2:(Need By: TBD) rack/setup/install kubernetes2018 - https://phabricator.wikimedia.org/T294299 (10RobH) [19:32:21] 10ops-codfw, 10DC-Ops, 10Kubernetes: Q2:(Need By: TBD) rack/setup/install kubernetes2018 - https://phabricator.wikimedia.org/T294299 (10RobH) [19:33:15] 10ops-codfw, 10DC-Ops, 10Kubernetes: Q2:(Need By: TBD) rack/setup/install kubernetes2018 - https://phabricator.wikimedia.org/T294299 (10RobH) a:03Papaul [19:33:30] sorry but I'm out without my laptop at the moment [19:33:31] (03PS1) 10Urbanecm: Set default two-letter NS_PROJECT aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734383 (https://phabricator.wikimedia.org/T293839) [19:33:40] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Dzahn) [19:33:42] just got the page [19:34:00] RECOVERY - MariaDB read only s3 on db1112 is OK: Version 10.4.18-MariaDB-log, Uptime 128s, read_only: True, event_scheduler: True, 11.71 QPS, connection latency: 0.004597s, query latency: 0.000582s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [19:34:28] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Dzahn) @Papaul Afraid this is a long story. just saw `mw2255.mgmt` alerting in Ic... [19:35:28] ACKNOWLEDGEMENT - SSH on mw2255.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:35:45] volans: don't worry, it's already over basically [19:35:52] mutante: ack thx [19:36:00] a DB server died and has been depooled, it will continue tomorrow [19:36:13] it affects cloud [19:36:38] page was delayed [19:40:39] !log cumin2002 - sudo systemctl reset-failed to clear Icinga alert about failed but (now) non-existing service database-backups-snapshots.service, assuming it's a case of "only in active DC" [19:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:51] (03CR) 10Zabe: Set default two-letter NS_PROJECT aliases (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734383 (https://phabricator.wikimedia.org/T293839) (owner: 10Urbanecm) [19:42:10] (03CR) 10Urbanecm: Set default two-letter NS_PROJECT aliases (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734383 (https://phabricator.wikimedia.org/T293839) (owner: 10Urbanecm) [19:42:34] !log icinga - ACKing all unhandled CRIT alerts on hosts with "dev" or "test" in their name, regardless of notifications being disabled or not. just so that we get more signal than noise in actual unhandled CRITs in web UI [19:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:03] ACKNOWLEDGEMENT - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rasdaemon.service daniel_zahn dev/test https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:44:16] (03CR) 10Zabe: [C: 03+1] Set default two-letter NS_PROJECT aliases (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734383 (https://phabricator.wikimedia.org/T293839) (owner: 10Urbanecm) [19:45:18] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2255.codfw.wmnet [19:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:34] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2255.codfw.wmnet [19:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:04] !log mw2255 - depooled=inactive (incl "dsh groups"), shut down physically for T283582 - can be worked on anytime [19:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:11] T283582: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 [19:47:40] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on mw2255.codfw.wmnet with reason: DRAC upgrade [19:47:42] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on mw2255.codfw.wmnet with reason: DRAC upgrade [19:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] chrisalbon and accraze: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211025T2000). [20:00:42] 10SRE, 10Infrastructure-Foundations, 10Packaging, 10Toolforge, 10Patch-For-Review: Please add php-imagick and php-redis packages to apt.wikimedia.org thirdparty/php72 - https://phabricator.wikimedia.org/T200666 (10Dzahn) This is needed for 7.4 now < James_F> We're depending on php-imagick which doesn't... [20:01:41] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-12), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 (10ldelench_wmf) [20:02:26] 10ops-eqiad, 10DBA, 10DC-Ops, 10cloud-services-team (Kanban): db1112 (s3 contribs/rc replica) is down - https://phabricator.wikimedia.org/T294295 (10Dzahn) Server is currently back up and sitting at login. Forwarding the DIMM replacement question to ops-eqiad. adding tag. [20:03:01] 10ops-eqiad, 10DBA, 10DC-Ops, 10cloud-services-team (Kanban): db1112 - DIMM replacement (was: db1112 (s3 contribs/rc replica) is down) - https://phabricator.wikimedia.org/T294295 (10Dzahn) [20:03:16] 10ops-eqiad, 10DBA, 10DC-Ops, 10cloud-services-team (Kanban): db1112 - DIMM replacement (was: db1112 (s3 contribs/rc replica) is down) - https://phabricator.wikimedia.org/T294295 (10Dzahn) [20:04:55] 10SRE, 10SRE-OnFire, 10SRE Observability (FY2021/2022-Q2): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10herron) >>! In T202061#7176114, @CDanis wrote: > I was thinking we would have a status.wikimedia.org that serves a HTTP 302 to the o... [20:08:54] 10ops-eqiad, 10DC-Ops, 10Kubernetes: Q2:(Need By: TBD) rack/setup/install kubernetes1022 - https://phabricator.wikimedia.org/T294301 (10RobH) [20:10:46] 10ops-eqiad, 10DC-Ops, 10Kubernetes: Q2:(Need By: TBD) rack/setup/install kubernetes1022 - https://phabricator.wikimedia.org/T294301 (10RobH) [20:11:02] 10ops-eqiad, 10DC-Ops, 10Kubernetes: Q2:(Need By: TBD) rack/setup/install kubernetes1022 - https://phabricator.wikimedia.org/T294301 (10RobH) a:03Jclark-ctr [20:11:47] (03PS1) 10Herron: exim: aggressively retry messages to alert.victorops.com addresses [puppet] - 10https://gerrit.wikimedia.org/r/734391 (https://phabricator.wikimedia.org/T294166) [20:18:37] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, and 2 others: Alert that should have paged did not reach VictorOps because of partial networking outage - https://phabricator.wikimedia.org/T294166 (10herron) [20:20:42] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 24.87 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:21:36] 10SRE, 10serviceops: Package php 7.4 for wikimedia production - https://phabricator.wikimedia.org/T293449 (10Legoktm) We needed `ast` and `imagick` for CI, so I've uploaded `php7.4-` versions of those too. [20:24:52] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:28:18] 10ops-codfw, 10DC-Ops, 10SRE Observability (FY2021/2022-Q2): Q2:(Need By: TBD) rack/setup/install prometheus200[56] - https://phabricator.wikimedia.org/T294302 (10RobH) [20:28:46] 10SRE, 10DBA, 10observability, 10Sustainability (Incident Followup): Monitor/dashboard number of queries killed by the automatic query killer - https://phabricator.wikimedia.org/T293531 (10herron) Another thought is something like https://github.com/justwatchcom/sql_exporter to return results of a SQL quer... [20:29:07] 10ops-codfw, 10DC-Ops, 10SRE Observability (FY2021/2022-Q2): Q2:(Need By: TBD) rack/setup/install prometheus200[56] - https://phabricator.wikimedia.org/T294302 (10RobH) a:03fgiunchedi @fgiunchedi, I had to make some assumptions in the racking details section of this task. Can you please review it all, co... [20:29:31] 10ops-codfw, 10DC-Ops, 10SRE Observability (FY2021/2022-Q2): Q2:(Need By: TBD) rack/setup/install prometheus200[56] - https://phabricator.wikimedia.org/T294302 (10RobH) [20:49:04] (03PS1) 10Legoktm: admin: Update my (legoktm's) dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/734400 [20:50:31] (03CR) 10Legoktm: [C: 03+2] admin: Update my (legoktm's) dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/734400 (owner: 10Legoktm) [20:54:52] (03PS1) 10Herron: rsyslog: centralize remote_server_tls lookups into single location in hiera [puppet] - 10https://gerrit.wikimedia.org/r/734401 (https://phabricator.wikimedia.org/T292196) [20:57:00] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10colewhite) >>! In T288851#7454801, @Joe wrote: > * What topic should I use on kafka? We talked offline a bit. Although I could not find it in t... [20:57:33] (03PS2) 10Herron: rsyslog: centralize remote_server_tls lookups into single location in hiera [puppet] - 10https://gerrit.wikimedia.org/r/734401 (https://phabricator.wikimedia.org/T292196) [20:57:41] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/734401 (https://phabricator.wikimedia.org/T292196) (owner: 10Herron) [20:59:06] (03PS3) 10Herron: rsyslog: centralize remote_syslog_tls lookups into single location in hiera [puppet] - 10https://gerrit.wikimedia.org/r/734401 (https://phabricator.wikimedia.org/T292196) [21:00:04] Reedy and sbassett: Time to snap out of that daydream and deploy Weekly Security deployment window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211025T2100). [21:02:05] (03PS1) 10Herron: rsyslog: switch codfw TLS remote syslog destination to centrallog2002 [puppet] - 10https://gerrit.wikimedia.org/r/734405 (https://phabricator.wikimedia.org/T292196) [21:02:59] (03PS2) 10Herron: rsyslog: switch codfw TLS remote syslog destination to centrallog2002 [puppet] - 10https://gerrit.wikimedia.org/r/734405 (https://phabricator.wikimedia.org/T292196) [21:03:13] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/734401 (https://phabricator.wikimedia.org/T292196) (owner: 10Herron) [21:08:59] (03CR) 10Herron: "this one should be an effective noop" [puppet] - 10https://gerrit.wikimedia.org/r/734401 (https://phabricator.wikimedia.org/T292196) (owner: 10Herron) [21:09:31] (03PS3) 10Herron: rsyslog: switch codfw TLS remote syslog destination to centrallog2002 [puppet] - 10https://gerrit.wikimedia.org/r/734405 (https://phabricator.wikimedia.org/T292196) [21:10:51] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/734405 (https://phabricator.wikimedia.org/T292196) (owner: 10Herron) [21:12:27] 10SRE, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10serviceops: schedule downtime for contint2001 - https://phabricator.wikimedia.org/T294271 (10Dzahn) also see T256422 - switch contint prod server back from contint2001 to contint1001 [21:12:33] 10SRE, 10ops-eqdfw: cr2-eqdfw: PEM 1 Input Voltage Out Of Range flapping - https://phabricator.wikimedia.org/T294009 (10Papaul) Case open with Juniper case #: 2021-1025-348302 [21:18:14] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Ottomata) As we make these decisions, I'd love if we could keep {T291645} in mind. > What topic should I use on kafka? I support a separate t... [21:19:36] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Effeietsanders - https://phabricator.wikimedia.org/T294038 (10Dzahn) Hi @effeietsanders, I noticed in the log you posted it says " as 'lgelauff'". But your user name here is `effeietsanders`. So it seems like y... [21:19:53] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Effeietsanders - https://phabricator.wikimedia.org/T294038 (10Dzahn) 05Resolved→03Open [21:22:34] (03CR) 10Dzahn: [C: 03+2] "compiled on a bunch of hosts using quickdatacopy and all noop: https://puppet-compiler.wmflabs.org/compiler1001/31889/" [puppet] - 10https://gerrit.wikimedia.org/r/733083 (https://phabricator.wikimedia.org/T294080) (owner: 10Dzahn) [21:24:54] (03CR) 10Jdlrobson: [C: 04-1] "Please make this change in a single patch. Merge into https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/734361" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734359 (https://phabricator.wikimedia.org/T294189) (owner: 10Odder) [21:25:13] (03CR) 10Jdlrobson: [C: 04-1] "Please make this change in a single patch rather than 2." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734361 (https://phabricator.wikimedia.org/T294189) (owner: 10Odder) [21:28:24] (03PS1) 10Dzahn: grafana::production: exclude grafana.db-journal from rsync [puppet] - 10https://gerrit.wikimedia.org/r/734408 (https://phabricator.wikimedia.org/T294080) [21:34:09] (03CR) 10Dzahn: "can't compile yet because the new VMs are not known in compiler yet so we need to sync facts.. but yea.. this should work now" [puppet] - 10https://gerrit.wikimedia.org/r/734408 (https://phabricator.wikimedia.org/T294080) (owner: 10Dzahn) [21:37:59] (03CR) 10Dzahn: [C: 03+2] "pwn'ed" [dns] - 10https://gerrit.wikimedia.org/r/733140 (https://phabricator.wikimedia.org/T292415) (owner: 10Gerrit maintenance bot) [21:38:10] (03CR) 10Dzahn: [C: 03+2] "https://meta.wikimedia.org/wiki/Requests_for_new_languages/Wikipedia_Paiwan" [dns] - 10https://gerrit.wikimedia.org/r/733140 (https://phabricator.wikimedia.org/T292415) (owner: 10Gerrit maintenance bot) [21:38:55] (03PS3) 10Dzahn: Add pwn to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/733140 (https://phabricator.wikimedia.org/T292415) (owner: 10Gerrit maintenance bot) [21:40:55] !log authdns1001 (DNS) - sudo authdns-update, add new project language "pwn" (Paiwan) for T292415 [21:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:02] T292415: Create Wikipedia Paiwan - https://phabricator.wikimedia.org/T292415 [21:44:42] (03CR) 10Dzahn: "pwn.wikipedia.org is an alias for dyna.wikimedia.org." [dns] - 10https://gerrit.wikimedia.org/r/733140 (https://phabricator.wikimedia.org/T292415) (owner: 10Gerrit maintenance bot) [21:45:27] mutante: thanks! https://gerrit.wikimedia.org/r/c/operations/dns/+/733141 would be appreciated too 🙂 [21:46:06] yes yes, I am aware [21:46:07] (03PS3) 10Urbanecm: Add ami to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/733141 (https://phabricator.wikimedia.org/T292414) (owner: 10Gerrit maintenance bot) [21:46:22] ah, okay, then I'll stop interrupting :) [21:47:12] (03CR) 10Dzahn: [C: 03+2] "https://en.wikipedia.org/wiki/Amis_language" [dns] - 10https://gerrit.wikimedia.org/r/733141 (https://phabricator.wikimedia.org/T292414) (owner: 10Gerrit maintenance bot) [21:48:25] (03PS5) 10Ebernhardson: query_service: Add new oauth related configuration [puppet] - 10https://gerrit.wikimedia.org/r/732801 (https://phabricator.wikimedia.org/T280006) [21:48:48] (03PS6) 10Ryan Kemper: query_service: Add new oauth related configuration [puppet] - 10https://gerrit.wikimedia.org/r/732801 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [21:49:30] (03CR) 10jerkins-bot: [V: 04-1] query_service: Add new oauth related configuration [puppet] - 10https://gerrit.wikimedia.org/r/732801 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [21:50:26] (03CR) 10Cwhite: [C: 03+2] role: add logging::opensearch::collector role [puppet] - 10https://gerrit.wikimedia.org/r/721397 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [21:52:09] !log new project language "ami" added - Sowal no 'Amis is the Formosan language of the 'Amis (or Ami), an indigenous people living along the east coast of Taiwan. - T292414 [21:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:16] T292414: Create Wikipedia Amis - https://phabricator.wikimedia.org/T292414 [21:53:44] !log new project language "pwn" added - Paiwan is a native language of Taiwan, spoken by the Paiwan, a Taiwanese indigenous people. T292415 [21:53:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:51] T292415: Create Wikipedia Paiwan - https://phabricator.wikimedia.org/T292415 [21:54:49] (03CR) 10Dzahn: "ami.wikipedia.org is an alias for dyna.wikimedia.org." [dns] - 10https://gerrit.wikimedia.org/r/733141 (https://phabricator.wikimedia.org/T292414) (owner: 10Gerrit maintenance bot) [21:56:15] (03CR) 10Cwhite: [C: 03+2] profile: add beta logstash profile [puppet] - 10https://gerrit.wikimedia.org/r/727627 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [21:57:33] (03PS7) 10Ebernhardson: query_service: Add new oauth related configuration [puppet] - 10https://gerrit.wikimedia.org/r/732801 (https://phabricator.wikimedia.org/T280006) [22:06:38] (03PS1) 10Ryan Kemper: wcqs: add dummy oauth_access_token_secret [labs/private] - 10https://gerrit.wikimedia.org/r/734418 (https://phabricator.wikimedia.org/T280006) [22:08:43] 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install new mr1-ulsfo - https://phabricator.wikimedia.org/T294314 (10RobH) [22:08:53] 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install new mr1-ulsfo - https://phabricator.wikimedia.org/T294314 (10RobH) [22:13:00] (03CR) 10Ebernhardson: [C: 03+1] wcqs: add dummy oauth_access_token_secret [labs/private] - 10https://gerrit.wikimedia.org/r/734418 (https://phabricator.wikimedia.org/T280006) (owner: 10Ryan Kemper) [22:14:55] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] wcqs: add dummy oauth_access_token_secret [labs/private] - 10https://gerrit.wikimedia.org/r/734418 (https://phabricator.wikimedia.org/T280006) (owner: 10Ryan Kemper) [22:16:33] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31895/console" [puppet] - 10https://gerrit.wikimedia.org/r/732801 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [22:16:36] (03PS8) 10Cwhite: hiera: add minimal logstash-beta-next hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/723619 (https://phabricator.wikimedia.org/T288618) [22:17:54] (03CR) 10Cwhite: [C: 03+2] hiera: add minimal logstash-beta-next hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/723619 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [22:18:43] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] query_service: Add new oauth related configuration [puppet] - 10https://gerrit.wikimedia.org/r/732801 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [22:27:31] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@13448f1] (wcqs): Deploy 0.3.90 to WCQS [22:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:45] (03PS1) 10Legoktm: Disable DPL on Wikibooks where not in use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734421 (https://phabricator.wikimedia.org/T287916) [22:29:47] (03PS1) 10Legoktm: Disable DPL on Wikinews where not in use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734422 (https://phabricator.wikimedia.org/T287916) [22:29:49] (03PS1) 10Legoktm: Disable DPL on Wikiquotes where not in use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734423 (https://phabricator.wikimedia.org/T287916) [22:29:51] (03PS1) 10Legoktm: Disable DPL on Wikisources where not in use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734424 (https://phabricator.wikimedia.org/T287916) [22:29:53] (03PS1) 10Legoktm: Disable DPL on Wikiversities where not in use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734425 (https://phabricator.wikimedia.org/T287916) [22:29:55] (03PS1) 10Legoktm: Disable DPL on opt-in wikis where not in use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734426 (https://phabricator.wikimedia.org/T287916) [22:30:36] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@13448f1] (wcqs): Deploy 0.3.90 to WCQS (duration: 03m 04s) [22:30:39] thanks for working on it legoktm :) [22:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:19] ^.^ [22:42:10] (03PS1) 10Zabe: Optimize astwiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734427 [22:42:30] (03Abandoned) 10Zabe: Optimize astwiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734427 (owner: 10Zabe) [22:44:24] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@e908052] (wcqs): Deploy 0.3.90 to WCQS [22:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:08] PROBLEM - Blazegraph Port for wcqs-blazegraph on wcqs1001 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:49:10] (03PS1) 10Zabe: test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734428 [22:49:26] PROBLEM - Blazegraph process -wcqs-blazegraph- on wcqs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:49:52] PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:53:20] !log uploaded PHP 7.4.25 to apt.wm.o (DSA-4992-1) [22:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:22] !log [wcqs] Downtimed `wcqs*` until roughly a week from now (while we setup oauth) [22:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:05] RoanKattouw and Urbanecm: Time to snap out of that daydream and deploy UTC late backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211025T2300). [23:00:05] Juan_90264: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:27] (03PS1) 10Zabe: Fix HD logo size in some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734429 [23:01:06] (03CR) 10Zabe: [C: 04-1] "logos are partly wrong" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734429 (owner: 10Zabe) [23:01:33] Hello [23:01:47] Hello! [23:01:59] I'll do the deployment today [23:02:26] Perfect [23:02:41] (03PS4) 10Catrope: Create an alias for the Appendix and Appendix_talk namespace on mywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/733120 (https://phabricator.wikimedia.org/T291146) (owner: 10Juan90264) [23:02:45] (03CR) 10Catrope: [C: 03+2] Create an alias for the Appendix and Appendix_talk namespace on mywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/733120 (https://phabricator.wikimedia.org/T291146) (owner: 10Juan90264) [23:03:54] (03Merged) 10jenkins-bot: Create an alias for the Appendix and Appendix_talk namespace on mywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/733120 (https://phabricator.wikimedia.org/T291146) (owner: 10Juan90264) [23:04:29] Great [23:06:53] Juan_90264: Your patch is on mwdebug1002, please test there [23:07:04] Okay [23:07:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:40] RoanKattouw: I tested and approved [23:10:45] Alright, deploying [23:10:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:13] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Create alias for Appendix and Appendix_talk namespaces on mywiktionary (T291146) (duration: 00m 55s) [23:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:19] T291146: Appendix: namespace did not automictically localize on mywikt - https://phabricator.wikimedia.org/T291146 [23:12:46] And that was the only patch, so we're done! [23:13:37] (03PS4) 10Zabe: Fix HD logo size in some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734429 (https://phabricator.wikimedia.org/T250731) [23:15:53] It's working, thanks Roan [23:21:49] 10SRE, 10MW-on-K8s, 10serviceops, 10MW-1.37-notes (1.37.0-wmf.20; 2021-08-23), and 2 others: Make HTTP calls work within mediawiki on kubernetes - https://phabricator.wikimedia.org/T288848 (10Legoktm) MultiHttpClient is more complicated than the previous part since it's in includes/libs/ and isn't supposed... [23:27:18] (03PS1) 10Zabe: Renaming $wmfEtcdLastModifiedIndex to $wmgEtcdLastModifiedIndex [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734431 (https://phabricator.wikimedia.org/T45956)