[00:00:04] twentyafterfour: Dear deployers, time to do the Phabricator update deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210812T0000). [00:12:47] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [00:14:21] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [00:14:45] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [00:16:15] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [00:39:29] 10SRE, 10Services, 10Toolhub, 10serviceops, and 2 others: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10bd808) [01:01:11] RECOVERY - Check systemd state on gitlab1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:56:04] 10SRE, 10Performance-Team, 10serviceops: Evaluate using igbinary for MW php-apcu at WMF (apc.serializer) - https://phabricator.wikimedia.org/T225074 (10Krinkle) [01:56:29] 10SRE, 10Performance-Team, 10serviceops: Evaluate using igbinary for MW php-apcu at WMF - https://phabricator.wikimedia.org/T225074 (10Krinkle) [01:58:03] (03PS6) 10Acamicamacaraca: Add namespace aliases for hr.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710564 (https://phabricator.wikimedia.org/T287024) [01:58:12] (03PS7) 10Acamicamacaraca: Add namespace aliases for hr.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710564 (https://phabricator.wikimedia.org/T287024) [01:58:56] 10SRE, 10Performance-Team, 10serviceops: Evaluate using igbinary for MW php-apcu at WMF - https://phabricator.wikimedia.org/T225074 (10Krinkle) Looking around on "the Internet", sources say igbinary is by default slower than php by default, but when setting `igbinary.compact_strings=Off` it should be faster.... [02:58:59] RECOVERY - wikimedia-client-errors-alerts grafana alert on alert1001 is OK: OK: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is not alerting. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [04:06:25] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 221, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:07:13] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:12:47] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:21:26] !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.81`. Pre-deploy tests passing on canary `wdqs1003` [04:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:23:49] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@9d03aaa]: 0.3.81 [04:23:55] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:25:47] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:26:31] !log [WDQS Deploy] Tests passing following deploy of `0.3.81` on canary `wdqs1003`; proceeding to rest of fleet [04:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:29:33] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:36:43] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:40:52] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@9d03aaa]: 0.3.81 (duration: 17m 03s) [04:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:41:05] !log [WDQS] `wdqs2004`'s disk is full due to overinflated `wikidata.jnl`, nuking and depooling: `sudo rm -fv /srv/wdqs/wikidata.jnl && sudo depool` [04:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:41:26] !log [WDQS Deploy] Re-rolling deploy so that `wdqs2004` gets deployed to [04:41:29] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@9d03aaa]: 0.3.81 [04:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:42:03] (03PS6) 10Marostegui: mariadb: Promote db1107 to m3 master. [puppet] - 10https://gerrit.wikimedia.org/r/711105 (https://phabricator.wikimedia.org/T288197) [04:42:23] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2004 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:42:37] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:43:23] PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:43:36] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@9d03aaa]: 0.3.81 (duration: 02m 07s) [04:43:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:43:55] RECOVERY - Disk space on wdqs2004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=wdqs2004&var-datasource=codfw+prometheus/ops [04:44:15] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:44:43] !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [04:44:45] !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [04:44:48] !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [04:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:45:15] RECOVERY - Check systemd state on wdqs2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:45:27] PROBLEM - WDQS high update lag on wdqs2004 is CRITICAL: 2.592e+06 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [04:48:48] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1107 to m3 master. [puppet] - 10https://gerrit.wikimedia.org/r/711105 (https://phabricator.wikimedia.org/T288197) (owner: 10Marostegui) [04:57:31] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:59:23] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:08:41] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:10:33] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:14:13] !log [WDQS Deploy] Deploy complete. Successful test query placed on query.wikidata.org, there's no relevant criticals in Icinga, and Grafana looks good [05:14:17] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:02] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [05:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:10] !log [WDQS] `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2005.codfw.wmnet --dest wdqs2004.codfw.wmnet --reason "transferring fresh wikidata journal after nuking wdqs2004's" --blazegraph_instance blazegraph` [05:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:31] RECOVERY - snapshot of s4 in eqiad on alert1001 is OK: Last snapshot for s4 at eqiad (db1139.eqiad.wmnet:3314) taken on 2021-08-12 03:33:35 (1485 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [05:16:00] (03PS1) 10Marostegui: db1132: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/711951 (https://phabricator.wikimedia.org/T288197) [05:16:09] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:18:00] (03PS1) 10Marostegui: Revert "db2107: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/711710 [05:21:47] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:22:41] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:23:38] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:37:31] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:41:46] (03PS1) 10Tim Starling: Make ad-hoc logging handle election not being set [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/711711 (https://phabricator.wikimedia.org/T288711) [05:50:35] We'll put phabricator in RO for a few minutes in 10 minutes [05:54:27] (03PS2) 10Giuseppe Lavagetto: deploy-mwdebug: introduce interactive mode [puppet] - 10https://gerrit.wikimedia.org/r/711506 [06:00:05] marostegui and kormat: #bothumor My software never has bugs. It just develops random features. Rise for m3 database master failover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210812T0600). [06:00:14] kormat: I am going to go ahead [06:00:27] !log Failover m3 from db1132 to db1107 - T288197 [06:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:35] T288197: Failover m3 (phabricator) master (db1132) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T288197 [06:01:39] all done [06:01:50] \o/ [06:02:35] Phabricator works just fine for me [06:02:42] thanks moritzm! [06:06:41] (03CR) 10Marostegui: [C: 03+2] db1132: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/711951 (https://phabricator.wikimedia.org/T288197) (owner: 10Marostegui) [06:17:05] marostegui: 😅 [06:20:47] (03CR) 10Elukey: "modules/profile/templates/cumin/aliases.yaml.erb:druid-canary: P{druid1003.eqiad.wmnet}" [puppet] - 10https://gerrit.wikimedia.org/r/711661 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [06:23:41] (03CR) 10Tim Starling: [C: 03+2] Make ad-hoc logging handle election not being set [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/711711 (https://phabricator.wikimedia.org/T288711) (owner: 10Tim Starling) [06:27:52] (03Merged) 10jenkins-bot: Make ad-hoc logging handle election not being set [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/711711 (https://phabricator.wikimedia.org/T288711) (owner: 10Tim Starling) [06:31:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [06:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [06:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:59] !log installing c-ares security updates on Bullseye [06:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:02] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [06:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:23] PROBLEM - WDQS high update lag on wdqs2005 is CRITICAL: 5295 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [06:47:01] !log updating bullseye installations to the latest state of testing [06:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:09] RECOVERY - WDQS high update lag on wdqs2004 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 5401 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [06:49:28] !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.18/extensions/SecurePoll/includes/Crypt/GpgCrypt.php: fix for T288711 failure of election creation (duration: 01m 09s) [06:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:36] T288711: /wiki/Special:SecurePoll/create Error: Call to a member function getId() on null - https://phabricator.wikimedia.org/T288711 [06:52:37] 10SRE, 10Patch-For-Review: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 (10MoritzMuehlenhoff) >>! In T275873#7216558, @MoritzMuehlenhoff wrote: >>>! In T275873#7215311, @fgiunchedi wrote: >>> This is tracked by upstream at https://github.com/prometheus/node_e... [06:56:20] (03CR) 10Marostegui: [C: 03+2] Revert "db2107: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/711710 (owner: 10Marostegui) [06:58:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2107 (re)pooling @ 1%: After reimage', diff saved to https://phabricator.wikimedia.org/P17005 and previous config saved to /var/cache/conftool/dbconfig/20210812-065833-root.json [06:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:24] 10SRE, 10MW-on-K8s, 10serviceops: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10Legoktm) >>! In T261277#7245477, @Legoktm wrote: >> ...given there could be several dozens of such very small services > > I [[https://logstash.wikimedia.org/got... [07:00:02] PROBLEM - puppet last run on prometheus2004 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:02:50] PROBLEM - puppet last run on prometheus1004 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:03:06] that was me ^ [07:05:22] RECOVERY - puppet last run on prometheus2004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:05:56] (03CR) 10Filippo Giunchedi: [C: 03+1] pontoon: move traffic stack kafka settings to separate file [puppet] - 10https://gerrit.wikimedia.org/r/711558 (owner: 10Ema) [07:08:16] RECOVERY - puppet last run on prometheus1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:10:14] (03CR) 10Filippo Giunchedi: "LGTM, I think we should also adapt all omkafka usages, namely:" [puppet] - 10https://gerrit.wikimedia.org/r/711741 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [07:13:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid2002.codfw.wmnet [07:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2107 (re)pooling @ 5%: After reimage', diff saved to https://phabricator.wikimedia.org/P17006 and previous config saved to /var/cache/conftool/dbconfig/20210812-071337-root.json [07:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid2002.codfw.wmnet [07:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid1002.eqiad.wmnet [07:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid1002.eqiad.wmnet [07:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica1003.wikimedia.org [07:23:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:14] (03CR) 10Majavah: [C: 04-1] Add toolhub to LVS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711702 (https://phabricator.wikimedia.org/T280881) (owner: 10Legoktm) [07:25:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica1003.wikimedia.org [07:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:54] !log temp upgrade thanos to 0.22.0 on thanos-fe2001 to help debug a potential upstream issue [07:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:28] (03CR) 10Ema: "One more nit!" [alerts] - 10https://gerrit.wikimedia.org/r/710968 (https://phabricator.wikimedia.org/T283660) (owner: 10MMandere) [07:28:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2107 (re)pooling @ 10%: After reimage', diff saved to https://phabricator.wikimedia.org/P17007 and previous config saved to /var/cache/conftool/dbconfig/20210812-072841-root.json [07:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica1004.wikimedia.org [07:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica1004.wikimedia.org [07:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica2005.wikimedia.org [07:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica2005.wikimedia.org [07:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:19] (03CR) 10Ayounsi: [C: 03+1] Propose a format for profile contact data [puppet] - 10https://gerrit.wikimedia.org/r/711400 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [07:38:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica2006.wikimedia.org [07:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica2006.wikimedia.org [07:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:17] RECOVERY - WDQS high update lag on wdqs2005 is OK: (C)3600 ge (W)1200 ge 953.6 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [07:43:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2107 (re)pooling @ 15%: After reimage', diff saved to https://phabricator.wikimedia.org/P17008 and previous config saved to /var/cache/conftool/dbconfig/20210812-074344-root.json [07:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-fe2001.codfw.wmnet [07:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:01] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.524e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [07:48:39] RECOVERY - Thanos compact has not run on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [07:50:01] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={thanos-compact,thanos-rule} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:51:47] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:52:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2001.codfw.wmnet [07:52:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:37] (03PS4) 10MMandere: Traffic: Add varnish prometheus exporter alert [alerts] - 10https://gerrit.wikimedia.org/r/710968 (https://phabricator.wikimedia.org/T283660) [07:53:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [07:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:39] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.524e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [07:58:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [07:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2107 (re)pooling @ 20%: After reimage', diff saved to https://phabricator.wikimedia.org/P17009 and previous config saved to /var/cache/conftool/dbconfig/20210812-075848-root.json [07:58:51] (03CR) 10Ema: [C: 03+1] Traffic: Add varnish prometheus exporter alert [alerts] - 10https://gerrit.wikimedia.org/r/710968 (https://phabricator.wikimedia.org/T283660) (owner: 10MMandere) [07:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:57] (03CR) 10Ayounsi: [C: 03+1] "One of the longer term goal is to be able to open tasks for the relevant persons. But one could imagine having the matching Phabricator us" [puppet] - 10https://gerrit.wikimedia.org/r/711400 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [08:00:00] (03CR) 10Ema: [C: 03+2] pontoon: move traffic stack kafka settings to separate file [puppet] - 10https://gerrit.wikimedia.org/r/711558 (owner: 10Ema) [08:02:08] (03PS1) 10Marostegui: production-m5.sql: Add mailman grants from dbproxies [puppet] - 10https://gerrit.wikimedia.org/r/712085 (https://phabricator.wikimedia.org/T288093) [08:03:40] (03CR) 10Marostegui: [C: 03+2] production-m5.sql: Add mailman grants from dbproxies [puppet] - 10https://gerrit.wikimedia.org/r/712085 (https://phabricator.wikimedia.org/T288093) (owner: 10Marostegui) [08:07:58] (03CR) 10Filippo Giunchedi: [C: 03+1] Traffic: Add varnish prometheus exporter alert [alerts] - 10https://gerrit.wikimedia.org/r/710968 (https://phabricator.wikimedia.org/T283660) (owner: 10MMandere) [08:13:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2107 (re)pooling @ 30%: After reimage', diff saved to https://phabricator.wikimedia.org/P17010 and previous config saved to /var/cache/conftool/dbconfig/20210812-081351-root.json [08:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:08] (03CR) 10Giuseppe Lavagetto: [C: 03+2] deploy-mwdebug: introduce interactive mode [puppet] - 10https://gerrit.wikimedia.org/r/711506 (owner: 10Giuseppe Lavagetto) [08:14:24] (03CR) 10MMandere: [C: 03+2] Traffic: Add varnish prometheus exporter alert [alerts] - 10https://gerrit.wikimedia.org/r/710968 (https://phabricator.wikimedia.org/T283660) (owner: 10MMandere) [08:18:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host people2002.codfw.wmnet [08:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people2002.codfw.wmnet [08:21:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:39] (03PS1) 10Filippo Giunchedi: pontoon: add service_names [puppet] - 10https://gerrit.wikimedia.org/r/712098 [08:26:41] (03PS1) 10Filippo Giunchedi: pontoon: add sd module [puppet] - 10https://gerrit.wikimedia.org/r/712099 [08:26:43] (03PS1) 10Filippo Giunchedi: pontoon: add lb module [puppet] - 10https://gerrit.wikimedia.org/r/712100 [08:26:56] is there a way to make API requests in production, or get a PHP shell, where a wiki runs from a different train version than the current one? [08:27:14] (03CR) 10jerkins-bot: [V: 04-1] pontoon: add service_names [puppet] - 10https://gerrit.wikimedia.org/r/712098 (owner: 10Filippo Giunchedi) [08:27:20] (specifically I’d like to test wikidatawiki behavior on wmf.17 and it’s currently on wmf.18) [08:28:40] (03CR) 10jerkins-bot: [V: 04-1] pontoon: add sd module [puppet] - 10https://gerrit.wikimedia.org/r/712099 (owner: 10Filippo Giunchedi) [08:28:42] in theory you could manually adjust the wikiversions file on a mwdebug server, and revert when done [08:28:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2107 (re)pooling @ 40%: After reimage', diff saved to https://phabricator.wikimedia.org/P17011 and previous config saved to /var/cache/conftool/dbconfig/20210812-082855-root.json [08:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:33] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single for host cumin2002.codfw.wmnet [08:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:51] i might try that… [08:32:00] `sudo -u mwdeploy sed -i '/\bwikidatawiki\b/ s/18/17/' /srv/mediawiki/wikiversions.{json,php}`, and reset with `scap pull`, I guess [08:32:22] (03CR) 10Kormat: [C: 03+1] pontoon: move hiera files to 'settings' [puppet] - 10https://gerrit.wikimedia.org/r/711507 (owner: 10Filippo Giunchedi) [08:32:27] yeah I’ll quickly try that on mwdebug2001 [08:32:35] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install new linecards into routers - https://phabricator.wikimedia.org/T277339 (10ayounsi) The easiest is to ping @cmooney or myself on IRC, or schedule it longer in advance. Same day 6:44pm Phabricator ping is not ideal ;) [08:33:57] alright, I’m done, mwdebug2001 should be back to normal [08:34:01] thanks majavah [08:34:06] (03PS1) 10David Caro: prometheus.icinga-exporter: use caps for ceph too [puppet] - 10https://gerrit.wikimedia.org/r/712104 [08:34:22] (03PS2) 10Filippo Giunchedi: pontoon: add service_names [puppet] - 10https://gerrit.wikimedia.org/r/712098 [08:34:24] (03PS2) 10Filippo Giunchedi: pontoon: add sd module [puppet] - 10https://gerrit.wikimedia.org/r/712099 [08:34:26] (03PS2) 10Filippo Giunchedi: pontoon: add lb module [puppet] - 10https://gerrit.wikimedia.org/r/712100 [08:37:33] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus.icinga-exporter: use caps for ceph too [puppet] - 10https://gerrit.wikimedia.org/r/712104 (owner: 10David Caro) [08:37:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host people1003.eqiad.wmnet [08:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:50] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: move hiera files to 'settings' [puppet] - 10https://gerrit.wikimedia.org/r/711507 (owner: 10Filippo Giunchedi) [08:38:53] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin2002.codfw.wmnet [08:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people1003.eqiad.wmnet [08:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:32] (03PS3) 10Filippo Giunchedi: pontoon: add service_names [puppet] - 10https://gerrit.wikimedia.org/r/712098 [08:41:34] (03PS3) 10Filippo Giunchedi: pontoon: add sd module [puppet] - 10https://gerrit.wikimedia.org/r/712099 [08:41:36] (03PS3) 10Filippo Giunchedi: pontoon: add lb module [puppet] - 10https://gerrit.wikimedia.org/r/712100 [08:43:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host copernicium.wikimedia.org [08:43:38] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single for host theemin.codfw.wmnet [08:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:46] (03PS1) 10David Caro: openstack.pdns: increase the number of tcp connections [puppet] - 10https://gerrit.wikimedia.org/r/712114 (https://phabricator.wikimedia.org/T288725) [08:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2107 (re)pooling @ 50%: After reimage', diff saved to https://phabricator.wikimedia.org/P17012 and previous config saved to /var/cache/conftool/dbconfig/20210812-084359-root.json [08:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:28] jouncebot: now [08:45:28] No deployments scheduled for the next 1 hour(s) and 14 minute(s) [08:45:33] (03PS1) 10Kormat: ProductionServices: Add new pc hosts. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/712115 (https://phabricator.wikimedia.org/T284825) [08:45:43] ok, then I’ll probably backport the fix for T288724 soonish (though it’ll take a while to make it through CI) [08:45:43] T288724: defaultcontentmodel missing from most namespaces in Wikidata namespaces siteinfo (breaks pywikibot) - https://phabricator.wikimedia.org/T288724 [08:46:42] (03CR) 10Filippo Giunchedi: "This patch series (and the required commits to activate the functionality) are live on the pontoon-o11y stack (Cloud VPS project 'monitori" [puppet] - 10https://gerrit.wikimedia.org/r/712099 (owner: 10Filippo Giunchedi) [08:46:47] (03CR) 10Filippo Giunchedi: "This patch series (and the required commits to activate the functionality) are live on the pontoon-o11y stack (Cloud VPS project 'monitori" [puppet] - 10https://gerrit.wikimedia.org/r/712100 (owner: 10Filippo Giunchedi) [08:47:03] (03CR) 10David Caro: [V: 03+2 C: 03+2] prometheus.icinga-exporter: use caps for ceph too [puppet] - 10https://gerrit.wikimedia.org/r/712104 (owner: 10David Caro) [08:48:18] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host theemin.codfw.wmnet [08:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:27] (03PS2) 10Giuseppe Lavagetto: mediawiki: Migrate wikidatawiki dispatch crons to three systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/710520 (https://phabricator.wikimedia.org/T288175) (owner: 10Ladsgroup) [08:48:32] (03CR) 10Marostegui: [C: 03+1] "IPs, hostnames and racks look good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/712115 (https://phabricator.wikimedia.org/T284825) (owner: 10Kormat) [08:48:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host copernicium.wikimedia.org [08:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:10] (03PS1) 10Lucas Werkmeister (WMDE): Revert "Inject NamespaceInfo into EntitySourceDefinitionsConfigParser" [extensions/Wikibase] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/711714 (https://phabricator.wikimedia.org/T288724) [08:49:45] (03CR) 10Kormat: [C: 03+2] ProductionServices: Add new pc hosts. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/712115 (https://phabricator.wikimedia.org/T284825) (owner: 10Kormat) [08:51:14] (03Merged) 10jenkins-bot: ProductionServices: Add new pc hosts. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/712115 (https://phabricator.wikimedia.org/T284825) (owner: 10Kormat) [08:51:53] jouncebot: now [08:51:53] No deployments scheduled for the next 1 hour(s) and 8 minute(s) [08:53:22] !log kormat@deploy1002 Synchronized wmf-config/ProductionServices.php: Adding new pc hosts (duration: 01m 09s) [08:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:17] !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on cloudservices[1003-1004].wikimedia.org with reason: T288725 [08:55:19] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cloudservices[1003-1004].wikimedia.org with reason: T288725 [08:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:00] (03CR) 10David Caro: [C: 03+2] openstack.pdns: increase the number of tcp connections [puppet] - 10https://gerrit.wikimedia.org/r/712114 (https://phabricator.wikimedia.org/T288725) (owner: 10David Caro) [08:56:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2107 (re)pooling @ 60%: After reimage', diff saved to https://phabricator.wikimedia.org/P17013 and previous config saved to /var/cache/conftool/dbconfig/20210812-085902-root.json [08:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:43] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Revert "Inject NamespaceInfo into EntitySourceDefinitionsConfigParser" [extensions/Wikibase] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/711714 (https://phabricator.wikimedia.org/T288724) (owner: 10Lucas Werkmeister (WMDE)) [09:07:34] 10SRE-swift-storage, 10User-fgiunchedi: Put ms-be20[62-65] in service - https://phabricator.wikimedia.org/T288458 (10LSobanski) CC @jcrespo as moving traffic to eqiad could influence media backup execution timelines. [09:08:47] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "For deployment: sync first data-access/, then repo/ and client/, or all of Wikibase/ – I think the only important one is that data-access/" [extensions/Wikibase] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/711714 (https://phabricator.wikimedia.org/T288724) (owner: 10Lucas Werkmeister (WMDE)) [09:14:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2107 (re)pooling @ 80%: After reimage', diff saved to https://phabricator.wikimedia.org/P17014 and previous config saved to /var/cache/conftool/dbconfig/20210812-091406-root.json [09:14:07] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:52] (03PS1) 10Elukey: kubeflow: add the AWS_DEFAULT_REGION env variable to storage-initializer [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/712118 (https://phabricator.wikimedia.org/T272919) [09:15:31] (03PS1) 10Kormat: ProductionServices: Adjust formatting of parsercache entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/712119 [09:15:33] (03PS1) 10Kormat: ProductionServices: Promote pc2011 to primary of pc1. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/712120 (https://phabricator.wikimedia.org/T284825) [09:16:03] (03CR) 10Elukey: [V: 03+2 C: 03+2] kubeflow: add the AWS_DEFAULT_REGION env variable to storage-initializer [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/712118 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [09:17:13] (03CR) 10Marostegui: [C: 03+1] ProductionServices: Adjust formatting of parsercache entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/712119 (owner: 10Kormat) [09:18:45] (03CR) 10Marostegui: ProductionServices: Promote pc2011 to primary of pc1. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/712120 (https://phabricator.wikimedia.org/T284825) (owner: 10Kormat) [09:22:26] (03CR) 10Kormat: ProductionServices: Promote pc2011 to primary of pc1. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/712120 (https://phabricator.wikimedia.org/T284825) (owner: 10Kormat) [09:23:01] (03CR) 10Marostegui: [C: 03+1] ProductionServices: Promote pc2011 to primary of pc1. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/712120 (https://phabricator.wikimedia.org/T284825) (owner: 10Kormat) [09:23:45] (03CR) 10Kormat: [C: 03+2] ProductionServices: Promote pc2011 to primary of pc1. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/712120 (https://phabricator.wikimedia.org/T284825) (owner: 10Kormat) [09:23:51] (03CR) 10Kormat: [C: 03+2] ProductionServices: Adjust formatting of parsercache entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/712119 (owner: 10Kormat) [09:24:33] (03Merged) 10jenkins-bot: ProductionServices: Adjust formatting of parsercache entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/712119 (owner: 10Kormat) [09:24:37] (03Merged) 10jenkins-bot: ProductionServices: Promote pc2011 to primary of pc1. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/712120 (https://phabricator.wikimedia.org/T284825) (owner: 10Kormat) [09:24:48] jouncebot: now [09:24:48] No deployments scheduled for the next 0 hour(s) and 35 minute(s) [09:25:02] I’d like to deploy a Wikibase wmf.18 backport in a moment; anyone else deploying right now? kormat? [09:25:30] Lucas_WMDE: hey. i have a parsercache config change i'm just about to push [09:25:34] it should only take a minute, if that's ok [09:25:37] ok [09:25:41] sure, I’ll wait for that [09:25:49] I’m still waiting for gate-and-submit [09:26:06] cool. running now. [09:26:32] (03Merged) 10jenkins-bot: Revert "Inject NamespaceInfo into EntitySourceDefinitionsConfigParser" [extensions/Wikibase] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/711714 (https://phabricator.wikimedia.org/T288724) (owner: 10Lucas Werkmeister (WMDE)) [09:27:12] !log kormat@deploy1002 Synchronized wmf-config/ProductionServices.php: Promote pc2011 to primary of pc1 T284825 (duration: 01m 10s) [09:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:19] T284825: Productionize pc2011-pc2014 and pc1011-pc1014 - https://phabricator.wikimedia.org/T284825 [09:27:20] Lucas_WMDE: all done [09:27:24] ok thanks [09:28:17] backport seems to work on mwdebug2001, syncing in two steps [09:28:47] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:28:53] !log reconfiguring replication tree for pc1 T284825 [09:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2107 (re)pooling @ 100%: After reimage', diff saved to https://phabricator.wikimedia.org/P17015 and previous config saved to /var/cache/conftool/dbconfig/20210812-092909-root.json [09:29:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:48] there may be a brief spike in errors, though I think there won’t be [09:29:51] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.37.0-wmf.18/extensions/Wikibase/data-access/: Backport: [[gerrit:711714|Revert "Inject NamespaceInfo into EntitySourceDefinitionsConfigParser" (T288724)]] (1/2) (duration: 01m 08s) [09:29:55] 10SRE, 10SRE Observability (FY2021/2022-Q1): Remove the "Long running screen/tmux" Icinga check - https://phabricator.wikimedia.org/T288028 (10MoritzMuehlenhoff) [09:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:59] T288724: defaultcontentmodel missing from most namespaces in Wikidata namespaces siteinfo (breaks pywikibot) - https://phabricator.wikimedia.org/T288724 [09:30:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:37] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 8 hosts with reason: Reconfiguring replication tree T284825 [09:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:44] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 8 hosts with reason: Reconfiguring replication tree T284825 [09:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:15] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.37.0-wmf.18/extensions/Wikibase/: Backport: [[gerrit:711714|Revert "Inject NamespaceInfo into EntitySourceDefinitionsConfigParser" (T288724)]] (2/2) (duration: 01m 12s) [09:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:33] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:31:43] alright, I’m done [09:32:18] unfortunately I have to leave almost right away, but if anyone really needs to reach me because of that deployment, I put my phone number in ~lucaswerkmeister-wmde/phone on deploy1002 [09:33:23] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:34:08] (03PS1) 10Muehlenhoff: Disable the "long running screen/tmux session" check by default [puppet] - 10https://gerrit.wikimedia.org/r/712123 (https://phabricator.wikimedia.org/T288028) [09:34:48] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/707875 (owner: 10PipelineBot) [09:35:38] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/707484 (owner: 10PipelineBot) [09:36:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:41] (03PS1) 10Jcrespo: bacula: Change default retention to 60 days [puppet] - 10https://gerrit.wikimedia.org/r/712129 [09:44:25] (03PS2) 10Jcrespo: bacula: Change default retention to 60 days [puppet] - 10https://gerrit.wikimedia.org/r/712129 [09:45:49] (03CR) 10David Caro: wmcs.ceph: add cloudcephosd1018 as osd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711499 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro) [09:46:01] (03CR) 10Jcrespo: [C: 03+2] bacula: Change default retention to 60 days [puppet] - 10https://gerrit.wikimedia.org/r/712129 (owner: 10Jcrespo) [09:49:51] !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: Restarting to pick up Java security updates - hnowlan@cumin1001 [09:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:05] mvolz: It is that lovely time of the day again! You are hereby commanded to deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210812T1000). [10:04:27] (03PS1) 10Muehlenhoff: Remove access for holger [puppet] - 10https://gerrit.wikimedia.org/r/712158 [10:04:45] (03CR) 10Mvolz: [C: 03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/711459 (owner: 10PipelineBot) [10:04:47] (03PS2) 10Muehlenhoff: Remove access for holger [puppet] - 10https://gerrit.wikimedia.org/r/712158 [10:06:44] (03CR) 10Filippo Giunchedi: [C: 03+1] Disable the "long running screen/tmux session" check by default [puppet] - 10https://gerrit.wikimedia.org/r/712123 (https://phabricator.wikimedia.org/T288028) (owner: 10Muehlenhoff) [10:07:50] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/711459 (owner: 10PipelineBot) [10:08:29] !log mvolz@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' . [10:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:14] (03PS1) 10Filippo Giunchedi: install_server: use Bullseye for thanos-fe [puppet] - 10https://gerrit.wikimedia.org/r/712159 [10:12:28] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/712159 (owner: 10Filippo Giunchedi) [10:13:20] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for holger [puppet] - 10https://gerrit.wikimedia.org/r/712158 (owner: 10Muehlenhoff) [10:13:50] !log mvolz@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' . [10:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:38] !log mvolz@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' . [10:18:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db2107 into API', diff saved to https://phabricator.wikimedia.org/P17016 and previous config saved to /var/cache/conftool/dbconfig/20210812-101840-marostegui.json [10:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:44] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: use Bullseye for thanos-fe [puppet] - 10https://gerrit.wikimedia.org/r/712159 (owner: 10Filippo Giunchedi) [10:22:15] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Holger Knust out of all services on: 1743 hosts [10:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:24] (03CR) 10Btullis: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/711490 (owner: 10Muehlenhoff) [10:22:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Holger Knust out of all services on: 1743 hosts [10:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:59] (03PS1) 10Ema: pontoon: cache routing setup for pontoon-traffic [puppet] - 10https://gerrit.wikimedia.org/r/712166 [10:29:07] (03CR) 10Jcrespo: "There is a lot of monitor_screens:false hiera keys that were added when this was enabled. What is the intended strategy about that? Not a " [puppet] - 10https://gerrit.wikimedia.org/r/712123 (https://phabricator.wikimedia.org/T288028) (owner: 10Muehlenhoff) [10:29:17] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, please consider splitting into "thematic" files, like ats / varnish or a generic "caching" to get an idea of what each settings file" [puppet] - 10https://gerrit.wikimedia.org/r/712166 (owner: 10Ema) [10:33:17] (03CR) 10Muehlenhoff: Disable the "long running screen/tmux session" check by default (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/712123 (https://phabricator.wikimedia.org/T288028) (owner: 10Muehlenhoff) [10:34:12] (03CR) 10Jcrespo: [C: 03+1] Disable the "long running screen/tmux session" check by default (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/712123 (https://phabricator.wikimedia.org/T288028) (owner: 10Muehlenhoff) [10:42:16] (03PS2) 10Btullis: Begin decommission of druid1003.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/711661 (https://phabricator.wikimedia.org/T255148) [10:42:18] (03PS1) 10Vgutierrez: wmflib: Adopt Cfssl::Wildcard type [puppet] - 10https://gerrit.wikimedia.org/r/712193 [10:56:30] (03CR) 10Klausman: [C: 03+1] "post factum: I approve of this change and I am saddened by its necessity." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/711579 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [10:56:48] (03CR) 10Klausman: [C: 03+1] kubeflow: add the AWS_DEFAULT_REGION env variable to storage-initializer [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/712118 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [11:00:05] Amir1, Lucas_WMDE, and apergos: (Dis)respected human, time to deploy EU Backport and Config trainingYour patch may or may not be deployed at the sole discretion of the deployer (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210812T1100). Please do the needful. [11:00:44] here, no one has signed up for trainings today and there are no patches scheduled on the calendar [11:05:16] (03PS1) 10Muehlenhoff: logout cookbook: Quote CN and UID [cookbooks] - 10https://gerrit.wikimedia.org/r/712210 [11:08:14] (03CR) 10jerkins-bot: [V: 04-1] logout cookbook: Quote CN and UID [cookbooks] - 10https://gerrit.wikimedia.org/r/712210 (owner: 10Muehlenhoff) [11:14:04] (03CR) 10jerkins-bot: [V: 04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/712227 (owner: 10L10n-bot) [11:24:15] (03PS2) 10Muehlenhoff: logout cookbook: Quote CN and UID [cookbooks] - 10https://gerrit.wikimedia.org/r/712210 [11:26:57] (03CR) 10jerkins-bot: [V: 04-1] logout cookbook: Quote CN and UID [cookbooks] - 10https://gerrit.wikimedia.org/r/712210 (owner: 10Muehlenhoff) [11:42:17] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.01899 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [11:45:03] (03PS3) 10Muehlenhoff: logout cookbook: Quote CN and UID [cookbooks] - 10https://gerrit.wikimedia.org/r/712210 [11:47:40] (03CR) 10jerkins-bot: [V: 04-1] logout cookbook: Quote CN and UID [cookbooks] - 10https://gerrit.wikimedia.org/r/712210 (owner: 10Muehlenhoff) [11:47:44] !log installing bluez security updates on buster [11:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:49] (03PS4) 10Muehlenhoff: logout cookbook: Quote CN and UID [cookbooks] - 10https://gerrit.wikimedia.org/r/712210 [11:56:58] (03PS2) 10Ema: pontoon: cache routing setup for pontoon-traffic [puppet] - 10https://gerrit.wikimedia.org/r/712166 [11:56:59] !log installing openexr security updates [11:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:35] (03PS3) 10Filippo Giunchedi: pontoon: add config command [puppet] - 10https://gerrit.wikimedia.org/r/711543 [12:04:06] !log upgrade NIC firmware on thanos-fe100[12] - T286722 [12:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:14] T286722: Broadcom BCM57412 10G NIC and Bullseye installer - https://phabricator.wikimedia.org/T286722 [12:05:18] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=thanos-rule site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:08:33] known ^ [12:08:42] !log upgrade NIC firmware on thanos-fe100[34] - T286722 [12:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:26] !log upgrade NIC firmware on thanos-be1* - T286722 [12:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:33] T286722: Broadcom BCM57412 10G NIC and Bullseye installer - https://phabricator.wikimedia.org/T286722 [12:10:02] (03PS3) 10Btullis: Begin decommission of druid1003.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/711661 (https://phabricator.wikimedia.org/T255148) [12:12:45] (03PS3) 10Muehlenhoff: Apply MX role to mx2002 [puppet] - 10https://gerrit.wikimedia.org/r/711123 (https://phabricator.wikimedia.org/T286911) [12:15:07] (03CR) 10Muehlenhoff: [C: 03+2] Apply MX role to mx2002 [puppet] - 10https://gerrit.wikimedia.org/r/711123 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [12:18:54] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe1001.eqiad.wmnet with reason: REIMAGE [12:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:05] (03PS4) 10Filippo Giunchedi: pontoon: add config command [puppet] - 10https://gerrit.wikimedia.org/r/711543 [12:22:23] (03PS1) 10Muehlenhoff: acmechief: acmechief: allow mx2002 [puppet] - 10https://gerrit.wikimedia.org/r/712277 (https://phabricator.wikimedia.org/T286911) [12:23:07] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe1001.eqiad.wmnet with reason: REIMAGE [12:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:23] (03CR) 10Elukey: [C: 03+1] "Didn't find traces of druid1003 in refinery and puppet, superset's config also mentions an-druid1001, and I can't think of about other pla" [puppet] - 10https://gerrit.wikimedia.org/r/711661 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [12:27:36] (03CR) 10Ema: [C: 03+2] pontoon: cache routing setup for pontoon-traffic [puppet] - 10https://gerrit.wikimedia.org/r/712166 (owner: 10Ema) [12:28:29] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-codfw: Restarting to pick up Java security updates - hnowlan@cumin1001 [12:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:56] (03PS1) 10Muehlenhoff: mtail: On bullseye use the distro default (3.0.0-rc43) [puppet] - 10https://gerrit.wikimedia.org/r/712287 (https://phabricator.wikimedia.org/T275873) [12:37:23] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/712287 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [12:37:25] (03PS2) 10Vgutierrez: envoyproxy: Remove trailing whitespace [puppet] - 10https://gerrit.wikimedia.org/r/710494 (https://phabricator.wikimedia.org/T265880) [12:37:27] (03PS2) 10Vgutierrez: envoyproxy: Support V3 configuration API [puppet] - 10https://gerrit.wikimedia.org/r/710495 (https://phabricator.wikimedia.org/T265880) [12:37:29] (03PS3) 10Vgutierrez: envoyproxy: Add prefetched OCSP staple support [puppet] - 10https://gerrit.wikimedia.org/r/710496 (https://phabricator.wikimedia.org/T271421) [12:37:31] (03PS5) 10Vgutierrez: envoyproxy: Add dual stack cert support [puppet] - 10https://gerrit.wikimedia.org/r/710507 (https://phabricator.wikimedia.org/T271421) [12:37:33] (03PS3) 10Vgutierrez: envoyproxy: Support ciphersuite configuration [puppet] - 10https://gerrit.wikimedia.org/r/710577 (https://phabricator.wikimedia.org/T271421) [12:37:35] (03PS2) 10Vgutierrez: envoyproxy: Support ECDH curves configuration [puppet] - 10https://gerrit.wikimedia.org/r/710581 (https://phabricator.wikimedia.org/T271421) [12:37:37] (03PS2) 10Vgutierrez: envoyproxy: Add upstream PROXY protocol support [puppet] - 10https://gerrit.wikimedia.org/r/711386 (https://phabricator.wikimedia.org/T271421) [12:37:39] (03PS2) 10Vgutierrez: envoyproxy: Add STEK configuration support [puppet] - 10https://gerrit.wikimedia.org/r/711399 (https://phabricator.wikimedia.org/T271421) [12:37:41] (03PS2) 10Vgutierrez: cache: Provide an envoy STEK manager script [puppet] - 10https://gerrit.wikimedia.org/r/711407 (https://phabricator.wikimedia.org/T271421) [12:39:32] (03CR) 10jerkins-bot: [V: 04-1] envoyproxy: Add dual stack cert support [puppet] - 10https://gerrit.wikimedia.org/r/710507 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [12:39:47] uh, that's unexpected [12:40:45] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:41:43] 14:39:26 KeyError: key not found: "PARALLEL_PID_FILE" [12:41:51] hmmm CI glitch? [12:42:55] (03CR) 10Vgutierrez: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/710507 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [12:43:48] !log upgrade NIC firmware on thanos-be2* / thanos-fe2* - T286722 [12:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:58] T286722: Broadcom BCM57412 10G NIC and Bullseye installer - https://phabricator.wikimedia.org/T286722 [12:52:37] (03PS4) 10Btullis: Begin decommission of druid1003.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/711661 (https://phabricator.wikimedia.org/T255148) [12:52:47] (03CR) 10Btullis: Begin decommission of druid1003.eqiad.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711661 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [12:55:31] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/712287 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [12:56:14] (03CR) 10Btullis: [C: 03+2] Begin decommission of druid1003.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/711661 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [12:57:07] (03Abandoned) 10Effie Mouzeli: mwdebug: include nutcracker and mcrouter pools in values [deployment-charts] - 10https://gerrit.wikimedia.org/r/699432 (https://phabricator.wikimedia.org/T284420) (owner: 10Effie Mouzeli) [13:03:40] !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:sessionstore: Restarting to pick up Java security updates - hnowlan@cumin1001 [13:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:32] (03CR) 10Effie Mouzeli: [C: 03+1] Add task manager data port configuration for flink session cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/711152 (https://phabricator.wikimedia.org/T288531) (owner: 10ZPapierski) [13:10:35] (03PS2) 10Effie Mouzeli: admin: add comment about tillerClusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/705717 [13:10:46] (03CR) 10jerkins-bot: [V: 04-1] admin: add comment about tillerClusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/705717 (owner: 10Effie Mouzeli) [13:13:26] (03Abandoned) 10Effie Mouzeli: admin: add comment about tillerClusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/705717 (owner: 10Effie Mouzeli) [13:13:33] (03PS1) 10Effie Mouzeli: admin: add comment about tillerClusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/712307 [13:13:48] (03PS10) 10Effie Mouzeli: mediawiki::mcrouter_wancache: disable ssl listening on mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/705852 [13:17:12] PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:20:57] !log btullis@cumin1001 START - Cookbook sre.hosts.decommission for hosts druid1003.eqiad.wmnet [13:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:29] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki::mcrouter_wancache: disable ssl listening on mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/705852 (owner: 10Effie Mouzeli) [13:24:08] (03PS1) 10Btullis: Decommission druid1002.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/712315 (https://phabricator.wikimedia.org/T255148) [13:31:07] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts druid1003.eqiad.wmnet [13:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:11] Woo-hoo. Just decommissioned my first server. Hope it was the right one :-) [13:35:00] (03PS1) 10Jelto: gitlab::backup move backup cronjobs to puppet [puppet] - 10https://gerrit.wikimedia.org/r/712322 (https://phabricator.wikimedia.org/T274463) [13:39:25] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:sessionstore: Restarting to pick up Java security updates - hnowlan@cumin1001 [13:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:43] !log disable puppet on mediawiki hosts to merge 705852 [13:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:33] (03PS2) 10Hnowlan: cassandra: remove cassandra-metrics-collector [puppet] - 10https://gerrit.wikimedia.org/r/710985 (https://phabricator.wikimedia.org/T186567) [13:57:35] !log disabling puppet on P:cassandra to test removal of cassandra-metrics-agent [13:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:22] (03PS1) 10Jgiannelos: scripts: Allow overriding cache operation [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/712337 [14:07:35] (03PS1) 10Elukey: kubeflow: update storage-init's image and add variable for local gw [deployment-charts] - 10https://gerrit.wikimedia.org/r/712346 (https://phabricator.wikimedia.org/T272919) [14:07:45] (03CR) 10Hnowlan: [C: 03+2] cassandra: remove cassandra-metrics-collector [puppet] - 10https://gerrit.wikimedia.org/r/710985 (https://phabricator.wikimedia.org/T186567) (owner: 10Hnowlan) [14:10:55] (03CR) 10Elukey: Decommission druid1002.eqiad.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/712315 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [14:12:05] (03PS2) 10David Caro: wmsc.puppet_alert: force utf-8 encoding when opening files [puppet] - 10https://gerrit.wikimedia.org/r/711106 (https://phabricator.wikimedia.org/T288508) [14:12:07] (03CR) 10David Caro: wmsc.puppet_alert: force utf-8 encoding when opening files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/711106 (https://phabricator.wikimedia.org/T288508) (owner: 10David Caro) [14:13:49] (03PS2) 10Elukey: kubeflow: update storage-init's image and add variable for local gw [deployment-charts] - 10https://gerrit.wikimedia.org/r/712346 (https://phabricator.wikimedia.org/T272919) [14:16:31] (03PS2) 10Btullis: Decommission druid1002.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/712315 (https://phabricator.wikimedia.org/T255148) [14:16:54] (03CR) 10Jgiannelos: [C: 03+2] tegola: Add cronjob for tiles pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/701938 (owner: 10Jgiannelos) [14:17:42] RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:19:25] (03Merged) 10jenkins-bot: tegola: Add cronjob for tiles pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/701938 (owner: 10Jgiannelos) [14:24:46] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission druid1003.eqiad.wmnet - https://phabricator.wikimedia.org/T288736 (10RhinosF1) [14:25:42] !log reenabling puppet on P:cassandra [14:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:34] (03CR) 10Elukey: "LGTM, but I see druid daemons running on 1002, is the node still active?" [puppet] - 10https://gerrit.wikimedia.org/r/712315 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [14:30:27] (03CR) 10MSantos: [C: 03+2] scripts: Allow overriding cache operation [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/712337 (owner: 10Jgiannelos) [14:31:27] (03Merged) 10jenkins-bot: scripts: Allow overriding cache operation [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/712337 (owner: 10Jgiannelos) [14:33:28] !log reset to factory ps2-test-d8-codfw [14:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:33] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe1002.eqiad.wmnet with reason: REIMAGE [14:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:50] (03CR) 10Jgiannelos: postgresql::user: split HBA configuration into a different define (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709717 (https://phabricator.wikimedia.org/T283159) (owner: 10Hnowlan) [14:35:17] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:35:54] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on thanos-fe1002.eqiad.wmnet with reason: REIMAGE [14:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:10] (03PS4) 10Filippo Giunchedi: pontoon: add service_names [puppet] - 10https://gerrit.wikimedia.org/r/712098 [14:36:12] (03PS4) 10Filippo Giunchedi: pontoon: add sd module [puppet] - 10https://gerrit.wikimedia.org/r/712099 [14:36:14] (03PS4) 10Filippo Giunchedi: pontoon: add lb module [puppet] - 10https://gerrit.wikimedia.org/r/712100 [14:36:16] (03PS1) 10Filippo Giunchedi: role: add standard/firewall to pontoon::frontend [puppet] - 10https://gerrit.wikimedia.org/r/712364 [14:36:38] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:36:52] PROBLEM - Host ps2-test-d8-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:38:27] (03PS1) 10Cathal Mooney: Add cloudsw2-d5-eqiad to Homer [homer/public] - 10https://gerrit.wikimedia.org/r/712365 (https://phabricator.wikimedia.org/T277340) [14:39:06] (03CR) 10Filippo Giunchedi: [C: 03+2] role: add standard/firewall to pontoon::frontend [puppet] - 10https://gerrit.wikimedia.org/r/712364 (owner: 10Filippo Giunchedi) [14:42:57] (03CR) 10Btullis: Decommission druid1002.eqiad.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/712315 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [14:43:30] (03CR) 10Btullis: Decommission druid1002.eqiad.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/712315 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [14:44:30] PROBLEM - Host thanos-fe1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:10] RECOVERY - Host thanos-fe1002 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [14:47:16] reimaged ^ [14:48:26] PROBLEM - Druid broker on druid1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server broker https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:48:51] !log reset to factory ps-test-d8-codfw [14:48:56] PROBLEM - Druid middlemanager on druid1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:58] PROBLEM - Druid overlord on druid1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:49:02] PROBLEM - Druid historical on druid1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:49:06] PROBLEM - Druid coordinator on druid1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:49:18] btullis: ^ [14:49:20] PROBLEM - Check systemd state on druid1002 is CRITICAL: CRITICAL - degraded: The following units failed: druid-broker.service,druid-coordinator.service,druid-historical.service,druid-middlemanager.service,druid-overlord.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:49:25] Think you might need to downtime again [14:49:42] RhinosF1: Thanks. [14:49:49] Np [14:55:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={thanos-query,thanos-query-frontend,thanos-store} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:56:21] (03PS1) 10Vgutierrez: envoyproxy: Provide support for UDS upstreams [puppet] - 10https://gerrit.wikimedia.org/r/712368 (https://phabricator.wikimedia.org/T271421) [14:56:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:57:59] (03CR) 10Btullis: [C: 03+2] Decommission druid1002.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/712315 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [14:58:35] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={thanos-query,thanos-query-frontend,thanos-store} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:58:53] that's me ^ [14:59:47] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:00:01] !log btullis@cumin1001 START - Cookbook sre.hosts.decommission for hosts druid1002.eqiad.wmnet [15:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:07] (03CR) 10jerkins-bot: [V: 04-1] envoyproxy: Provide support for UDS upstreams [puppet] - 10https://gerrit.wikimedia.org/r/712368 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [15:03:35] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={thanos-query-frontend,thanos-store} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:04:50] (03PS2) 10Vgutierrez: envoyproxy: Provide support for UDS upstreams [puppet] - 10https://gerrit.wikimedia.org/r/712368 (https://phabricator.wikimedia.org/T271421) [15:04:51] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe1003.eqiad.wmnet with reason: REIMAGE [15:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:06:24] (03CR) 10Elukey: [C: 03+2] kubeflow: update storage-init's image and add variable for local gw [deployment-charts] - 10https://gerrit.wikimedia.org/r/712346 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [15:07:16] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on thanos-fe1003.eqiad.wmnet with reason: REIMAGE [15:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:38] (03CR) 10Vgutierrez: "pcc shows basically a NOOP for existing nodes using envoy: https://puppet-compiler.wmflabs.org/compiler1003/30557/" [puppet] - 10https://gerrit.wikimedia.org/r/712368 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [15:10:27] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts druid1002.eqiad.wmnet [15:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:57] PROBLEM - Check systemd state on thanos-fe1003 is CRITICAL: CRITICAL - degraded: The following units failed: nic-saturation-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:01] PROBLEM - Host thanos-fe1003 is DOWN: PING CRITICAL - Packet loss = 100% [15:16:16] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/712287 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [15:16:35] RECOVERY - Host thanos-fe1003 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [15:16:47] PROBLEM - Check systemd state on thanos-fe1003 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:59] RECOVERY - Check systemd state on thanos-fe1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:19:50] (03PS3) 10SBassett: Add growthexperiments to allowed logtypes [puppet] - 10https://gerrit.wikimedia.org/r/636436 (https://phabricator.wikimedia.org/T266477) (owner: 10Urbanecm) [15:24:04] (03PS1) 10Jgiannelos: scripts: Fix reading default value for env var [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/712375 (https://phabricator.wikimedia.org/T270175) [15:26:19] (03CR) 10Dave Pifke: "PCC output: https://puppet-compiler.wmflabs.org/compiler1001/30558/" [puppet] - 10https://gerrit.wikimedia.org/r/711166 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke) [15:27:40] (03CR) 10Muehlenhoff: [C: 03+2] mtail: On bullseye use the distro default (3.0.0-rc43) [puppet] - 10https://gerrit.wikimedia.org/r/712287 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [15:28:22] (03CR) 10SBassett: [C: 03+1] Add growthexperiments to allowed logtypes [puppet] - 10https://gerrit.wikimedia.org/r/636436 (https://phabricator.wikimedia.org/T266477) (owner: 10Urbanecm) [15:28:33] (03PS1) 10Jforrester: TranslationPage: Use Title::getPrefixedDBkey when extracting messages [extensions/Translate] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/711718 (https://phabricator.wikimedia.org/T288683) [15:31:22] RECOVERY - Host ps1-d8-codfw is UP: PING OK - Packet loss = 0%, RTA = 35.51 ms [15:32:42] (03CR) 10Abijeet Patro: [C: 03+1] "Unfortunately, I do not have permission to give CR +2 to this." [extensions/Translate] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/711718 (https://phabricator.wikimedia.org/T288683) (owner: 10Jforrester) [15:33:19] !log importing openjdk-8 8u302-b08-1+deb11u1 to apt.wikimedia.org/component/jdk8 T287960 [15:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:27] T287960: Import the openjdk8 packages in Bullseye - https://phabricator.wikimedia.org/T287960 [15:33:37] (03CR) 10MSantos: [C: 03+2] scripts: Fix reading default value for env var [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/712375 (https://phabricator.wikimedia.org/T270175) (owner: 10Jgiannelos) [15:34:35] (03Merged) 10jenkins-bot: scripts: Fix reading default value for env var [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/712375 (https://phabricator.wikimedia.org/T270175) (owner: 10Jgiannelos) [15:35:25] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host doh4002.wikimedia.org [15:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:07] 10SRE, 10Traffic, 10vm-requests: Please create a Ganeti VM for Wikidough in ulsfo - https://phabricator.wikimedia.org/T288630 (10Dzahn) ` sudo cookbook sre.ganeti.makevm --vcpus 2 --memory 8 --disk 10 --network public ulsfo doh4002 Ready to create Ganeti VM doh4002.wikimedia.org in the ganeti01.svc.ulsfo.wmn... [15:36:12] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:51] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host doh4002.wikimedia.org [15:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:50] 10SRE, 10Traffic, 10vm-requests: Please create a Ganeti VM for Wikidough in ulsfo - https://phabricator.wikimedia.org/T288630 (10Dzahn) ` 2021-08-12 15:37:50,980 [ERROR] Failed to run Traceback (most recent call last): File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 686, in main... [15:38:59] 10SRE, 10Traffic, 10vm-requests: Please create a Ganeti VM for Wikidough in ulsfo - https://phabricator.wikimedia.org/T288630 (10MoritzMuehlenhoff) Please don't create new instances with 10G "disks", these tend to cause more work in the long term, e.g. by filing up the root partition with kernels etc. 15G or... [15:40:26] (03PS3) 10Cwhite: profile: improve kafka_shipper rsyslog output ssl options [puppet] - 10https://gerrit.wikimedia.org/r/711741 (https://phabricator.wikimedia.org/T288618) [15:41:06] mutante: are you working on +doh4002 1H IN A 198.35.26.6 [15:41:10] +doh4002 1H IN AAAA 2620:0:863:1:198:35:26:6 [15:41:31] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Traffic, 10vm-requests: Please create a Ganeti VM for Wikidough in ulsfo - https://phabricator.wikimedia.org/T288630 (10Dzahn) [15:41:36] (03CR) 10Cwhite: profile: improve kafka_shipper rsyslog output ssl options (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711741 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [15:41:44] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Traffic, 10vm-requests: Please create a Ganeti VM for Wikidough in ulsfo - https://phabricator.wikimedia.org/T288630 (10Dzahn) 05Open→03Stalled [15:42:32] (03PS1) 10Jgiannelos: tegola-vector-tiles: Bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/712389 [15:43:01] papaul: not anymore, the cookbook failed for some reason [15:43:08] i have pending DNS commit [15:43:24] sigh, not sure how to clean it up then [15:44:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:21] let me see in netbox [15:44:24] mutante: i added it you can go back an just remove it [15:44:29] ah, it exited succesfully? [15:44:45] mutante: yes [15:45:01] ok papaul, thanks. looks like a software bug [15:45:12] mutante: you welcome [15:46:10] thanks both! [15:46:13] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki::mcrouter_wancache: disable ssl listening on mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/705852 (owner: 10Effie Mouzeli) [15:47:17] !log netbox - deleted 198.35.26.6 (doh4002) [15:47:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:39] sukhe: np, but I have not seen that error I pasted before [15:47:51] it's not the same that happens when out of IPs, something else [15:49:29] !log netbox - deleted 2620:0:863:1:198:35:26:6/64 (along with 198.35.26.6) due to the previous error when running makevm cookbook (T288630) [15:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:38] T288630: Please create a Ganeti VM for Wikidough in ulsfo - https://phabricator.wikimedia.org/T288630 [15:49:46] 10SRE, 10Patch-For-Review: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 (10MoritzMuehlenhoff) [15:49:56] 10SRE, 10Analytics, 10Infrastructure-Foundations: Import the openjdk8 packages in Bullseye - https://phabricator.wikimedia.org/T287960 (10MoritzMuehlenhoff) 05Open→03Resolved OpenJDK 8u302 has been rebuilt against the bootstrap packages (which were removed) and eventually imported. Resolving this, please... [15:50:20] !log powerdown ms-be2060 for relocation [15:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:48] PROBLEM - Host ms-be2060 is DOWN: PING CRITICAL - Packet loss = 100% [15:58:55] (03PS1) 10Dzahn: acme_chief: simplify regex for doh and add doh4002 [puppet] - 10https://gerrit.wikimedia.org/r/712393 [16:00:05] jbond and rzl: Time to snap out of that daydream and deploy Puppet request window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210812T1600). [16:00:05] dpifke: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:18] Here. [16:01:51] dpifke: with apologies -- can I get back to you in 30m? I have a meeting conflict and John is out of office [16:01:52] (03CR) 10Ssingh: [C: 03+1] acme_chief: simplify regex for doh and add doh4002 [puppet] - 10https://gerrit.wikimedia.org/r/712393 (owner: 10Dzahn) [16:02:26] rzl: No problem, see you in a bit. [16:06:40] (03PS1) 10Ssingh: Add doh4002 to BGP anycast in ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/712400 (https://phabricator.wikimedia.org/T283503) [16:07:24] RECOVERY - Host ms-be2060 is UP: PING OK - Packet loss = 0%, RTA = 31.71 ms [16:08:14] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/712405 [16:08:18] (03PS1) 10Ssingh: site: switch doh4002 to O:wikidough [puppet] - 10https://gerrit.wikimedia.org/r/712406 [16:10:41] !log mbsantos@deploy1002 Started deploy [tilerator/deploy@b88cf50]: Deploy tilerator 1.1.7-beta.5 [16:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:28] (03CR) 10Dzahn: [C: 03+2] acme_chief: simplify regex for doh and add doh4002 [puppet] - 10https://gerrit.wikimedia.org/r/712393 (owner: 10Dzahn) [16:13:12] !log mbsantos@deploy1002 Finished deploy [tilerator/deploy@b88cf50]: Deploy tilerator 1.1.7-beta.5 (duration: 02m 30s) [16:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:30] (03CR) 10Dzahn: [C: 03+2] arclamp: add temporary excimer-k8s pipeline [puppet] - 10https://gerrit.wikimedia.org/r/711166 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke) [16:13:48] dpifke: rzl: done [16:14:02] Thanks! :) [16:14:03] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [16:14:06] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [16:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:18] !log mbsantos@deploy1002 Started deploy [tilerator/deploy@b88cf50]: maps2010: [16:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:42] !log mbsantos@deploy1002 Finished deploy [tilerator/deploy@b88cf50]: maps2010: (duration: 00m 23s) [16:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:51] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [16:14:54] PROBLEM - tilerator on maps2004 is CRITICAL: connect to address 10.192.48.57 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [16:14:56] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [16:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:32] PROBLEM - tileratorui on maps2004 is CRITICAL: connect to address 10.192.48.57 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [16:15:37] !log mbsantos@deploy1002 Started deploy [tilerator/deploy@b88cf50]: maps2009: [16:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:58] dpifke: ran puppet on webperf100[12], change applied on 1002 [16:16:00] (03CR) 10Dave Pifke: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711580 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke) [16:16:01] !log mbsantos@deploy1002 Finished deploy [tilerator/deploy@b88cf50]: maps2009: (duration: 00m 24s) [16:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:00] maps2004 is depooled, hnowlan should we stop tilerator to avoid the error messages? [16:17:13] (03CR) 10Krinkle: "I don't think we can use this here. We run before any multiversion, MW or wmf-config code." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711580 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke) [16:17:48] mutante: Yup, looks good. [16:18:03] mbsantos: I can add a downtime but how many hours would make the most sense? [16:18:06] dpifke: cool [16:18:19] (03CR) 10Ayounsi: [C: 03+1] Add cloudsw2-d5-eqiad to Homer [homer/public] - 10https://gerrit.wikimedia.org/r/712365 (https://phabricator.wikimedia.org/T277340) (owner: 10Cathal Mooney) [16:19:22] mutante: this machine (old clsuter maps[1-2]00[1-4]) is going to be decomm in the next couple of weeks [16:20:41] mbsantos: ACK, a couple days would be easy, a couple weeks sounds a bit too long though I guess [16:21:23] let me just ACK it then [16:21:47] mutante: ack, we should stop the service through puppet anyway so it might be unnecessary to do anything other than that [16:21:59] !log mbsantos@deploy1002 Started deploy [tilerator/deploy@b88cf50]: maps2008: [16:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:14] ACKNOWLEDGEMENT - tilerator on maps2004 is CRITICAL: connect to address 10.192.48.57 and port 6534: Connection refused daniel_zahn depooled https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [16:22:14] ACKNOWLEDGEMENT - tileratorui on maps2004 is CRITICAL: connect to address 10.192.48.57 and port 6535: Connection refused daniel_zahn depooled https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [16:22:23] !log mbsantos@deploy1002 Finished deploy [tilerator/deploy@b88cf50]: maps2008: (duration: 00m 24s) [16:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:20] !log mbsantos@deploy1002 Started deploy [tilerator/deploy@b88cf50]: maps2007: [16:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:47] !log mbsantos@deploy1002 Finished deploy [tilerator/deploy@b88cf50]: maps2007: (duration: 00m 27s) [16:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:13] !log mbsantos@deploy1002 Started deploy [tilerator/deploy@b88cf50]: maps2006: [16:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:36] !log mbsantos@deploy1002 Finished deploy [tilerator/deploy@b88cf50]: maps2006: (duration: 00m 23s) [16:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:49] mbsantos: yea, not sure if worth it but stopping the service would not remove the alert [16:25:16] the ACK means it won't alert until next time it starts/stops again, so probably good enough [16:25:39] unless it's flapping it should be silent [16:26:00] !log mbsantos@deploy1002 Started deploy [tilerator/deploy@b88cf50]: maps2005: [16:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:10] mutante: thanks for the explanation [16:26:14] (03PS1) 10Hnowlan: scap: make maps2009 the default maps canary [puppet] - 10https://gerrit.wikimedia.org/r/712420 [16:26:17] I'll keep an eye on it [16:26:24] !log mbsantos@deploy1002 Finished deploy [tilerator/deploy@b88cf50]: maps2005: (duration: 00m 24s) [16:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:49] mbsantos: thanks:) [16:27:34] !log mbsantos@deploy1002 Started deploy [tilerator/deploy@b88cf50]: maps1010: [16:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:49] !log mbsantos@deploy1002 Finished deploy [tilerator/deploy@b88cf50]: maps1010: (duration: 00m 15s) [16:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:27] (03CR) 10Dzahn: "This will only work if the first puppet run installs everything perfectly without needing a second run. Otherwise the cookbook will fail o" [puppet] - 10https://gerrit.wikimedia.org/r/712406 (owner: 10Ssingh) [16:28:44] !log mbsantos@deploy1002 Started deploy [tilerator/deploy@b88cf50]: maps1009: [16:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:00] !log mbsantos@deploy1002 Finished deploy [tilerator/deploy@b88cf50]: maps1009: (duration: 00m 17s) [16:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:31] !log mbsantos@deploy1002 Started deploy [tilerator/deploy@b88cf50]: maps1008: [16:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:47] !log mbsantos@deploy1002 Finished deploy [tilerator/deploy@b88cf50]: maps1008: (duration: 00m 15s) [16:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:47] !log mbsantos@deploy1002 Started deploy [tilerator/deploy@b88cf50]: maps1007: [16:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:03] !log mbsantos@deploy1002 Finished deploy [tilerator/deploy@b88cf50]: maps1007: (duration: 00m 15s) [16:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:33] !log mbsantos@deploy1002 Started deploy [tilerator/deploy@b88cf50]: maps1006: [16:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:49] !log mbsantos@deploy1002 Finished deploy [tilerator/deploy@b88cf50]: maps1006: (duration: 00m 15s) [16:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:30] !log enabling puppet on mediawiki servers && rolling restart mcrouter [16:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:43] !log mbsantos@deploy1002 Started deploy [tilerator/deploy@b88cf50]: maps1005: [16:32:48] mutante: oh great, thank you! [16:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:58] !log mbsantos@deploy1002 Finished deploy [tilerator/deploy@b88cf50]: maps1005: (duration: 00m 15s) [16:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:06] rzl: np:) [16:33:10] (03PS2) 10Ssingh: site: add role insetup for doh4002 [puppet] - 10https://gerrit.wikimedia.org/r/712406 [16:33:24] (03CR) 10Ssingh: site: add role insetup for doh4002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/712406 (owner: 10Ssingh) [16:33:39] (03CR) 10Filippo Giunchedi: "Nice, thank you! re-reading the patch I don't think we should make SSL for kafka optional though, rather make the CA required like it is n" [puppet] - 10https://gerrit.wikimedia.org/r/711741 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [16:34:04] (03CR) 10Ayounsi: [C: 03+1] Add doh4002 to BGP anycast in ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/712400 (https://phabricator.wikimedia.org/T283503) (owner: 10Ssingh) [16:34:23] (03CR) 10Dzahn: [C: 03+1] "this is good but we were not able to create the VM yet, it also won't hurt to merge it before that though" [puppet] - 10https://gerrit.wikimedia.org/r/712406 (owner: 10Ssingh) [16:37:51] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host doh4002.wikimedia.org [16:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:13] (03PS1) 10Bearloga: statistics::discovery: Stop metric calculation [puppet] - 10https://gerrit.wikimedia.org/r/712422 (https://phabricator.wikimedia.org/T227782) [16:40:44] (03CR) 10Dzahn: [C: 03+2] site: add role insetup for doh4002 [puppet] - 10https://gerrit.wikimedia.org/r/712406 (owner: 10Ssingh) [16:40:51] (03CR) 10Abijeet Patro: [C: 03+1] "Nevermind I realize this will be +2'ed when deploying." [extensions/Translate] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/711718 (https://phabricator.wikimedia.org/T288683) (owner: 10Jforrester) [16:43:31] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Traffic, 10vm-requests: Please create a Ganeti VM for Wikidough in ulsfo - https://phabricator.wikimedia.org/T288630 (10Dzahn) >>! In T288630#7279170, @MoritzMuehlenhoff wrote: > Please don't create new instances with 10G "disks", these tend to cause mo... [16:43:54] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Traffic, 10vm-requests: Please create a Ganeti VM for Wikidough in ulsfo - https://phabricator.wikimedia.org/T288630 (10Dzahn) [16:47:28] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh4002.wikimedia.org [16:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:42] (03CR) 10Cathal Mooney: [C: 03+2] Add cloudsw2-d5-eqiad to Homer [homer/public] - 10https://gerrit.wikimedia.org/r/712365 (https://phabricator.wikimedia.org/T277340) (owner: 10Cathal Mooney) [16:49:17] (03PS2) 10Dave Pifke: profiler: use seperate pipeline inside k8s pods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711580 (https://phabricator.wikimedia.org/T288165) [16:49:20] (03Merged) 10jenkins-bot: Add cloudsw2-d5-eqiad to Homer [homer/public] - 10https://gerrit.wikimedia.org/r/712365 (https://phabricator.wikimedia.org/T277340) (owner: 10Cathal Mooney) [16:49:47] (03PS1) 10Dzahn: DHCP: add MAC for doh4002 [puppet] - 10https://gerrit.wikimedia.org/r/712429 (https://phabricator.wikimedia.org/T288630) [16:51:02] (03CR) 10Dzahn: [C: 03+2] DHCP: add MAC for doh4002 [puppet] - 10https://gerrit.wikimedia.org/r/712429 (https://phabricator.wikimedia.org/T288630) (owner: 10Dzahn) [16:57:38] 10SRE, 10ops-codfw, 10Patch-For-Review: codfw: Ship back Raritan test PDU - https://phabricator.wikimedia.org/T287762 (10Papaul) Both test PDU's removed and old PDU"s back in place. [17:00:04] chrisalbon and accraze: How many deployers does it take to do Services – Graphoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210812T1700). [17:02:34] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:04:00] (03CR) 10Jcrespo: "I see nothing blocker - like your previous patch, this seems it will require followup patches, but didn't see any major blocker. Look at t" [software/bernard] - 10https://gerrit.wikimedia.org/r/703490 (https://phabricator.wikimedia.org/T285438) (owner: 10H.krishna123) [17:05:49] 10SRE, 10MW-on-K8s, 10serviceops, 10Release-Engineering-Team (Radar): The restricted/mediawiki-webserver image should include skins and resources - https://phabricator.wikimedia.org/T285232 (10dancy) This doesn't work: `curl -v -H 'X-Wikimedia-Debug: backend=k8s-experimental' https://en.wikipedia.org/favi... [17:13:01] (03CR) 10Dzahn: "😊" [puppet] - 10https://gerrit.wikimedia.org/r/705852 (owner: 10Effie Mouzeli) [17:13:34] sukhe: ssh doh4002.wikimedia.org :) all yours [17:14:53] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Traffic, 10vm-requests: Please create a Ganeti VM for Wikidough in ulsfo - https://phabricator.wikimedia.org/T288630 (10Dzahn) 05Stalled→03Resolved a:03Dzahn - VM created - added to DHCP - installed OS - ran puppet (insetup) - verified SSH access... [17:18:03] mutante: thanks very much! [17:18:04] <3 [17:18:50] yep:) happy to [17:18:59] (03PS1) 10Ayounsi: Fix typo [homer/public] - 10https://gerrit.wikimedia.org/r/712467 [17:33:29] (03CR) 10Bstorm: "I'll try to merge and deploy this today, but no guarantees. The deployment is quite time consuming because it often requires depooling ser" [puppet] - 10https://gerrit.wikimedia.org/r/636436 (https://phabricator.wikimedia.org/T266477) (owner: 10Urbanecm) [17:41:50] (03CR) 10Bstorm: [C: 03+2] labstore: remove absented archive_export_d cron [puppet] - 10https://gerrit.wikimedia.org/r/711623 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [17:55:58] 10SRE, 10MW-on-K8s, 10serviceops, 10Release-Engineering-Team (Radar): The restricted/mediawiki-webserver image should include skins and resources - https://phabricator.wikimedia.org/T285232 (10Krinkle) `/favicon.ico` is (or should be) written to `w/favicon.php`. [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210812T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:19:08] (03PS2) 10Herron: acmechief: acmechief: allow mx2002 [puppet] - 10https://gerrit.wikimedia.org/r/712277 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [18:26:58] (03CR) 10Herron: [C: 03+1] Disable the "long running screen/tmux session" check by default [puppet] - 10https://gerrit.wikimedia.org/r/712123 (https://phabricator.wikimedia.org/T288028) (owner: 10Muehlenhoff) [18:29:31] Deployers if you're here there is https://phabricator.wikimedia.org/T281159#7279819 but it's not been scheduled and needs tests [18:33:26] well there are always some deployers. I can help if someone can join here to help to test it :) [18:35:24] pinged abi, let's see [18:40:24] it's around midnight in his timezone [18:40:40] yeah, realized too late :/ [18:40:53] Nikerabbit: but maybe you can also help to test it? :D [18:42:00] urbanecm: I'm on vacation ^^ [18:42:15] :( [18:43:48] I think being on vacation merits a :) [18:44:44] so the task seems to have enough info to test [18:45:44] twentyafterfour: well you can definitely deploy it if you want -- it's technically unreviewed code though (and without the team around), so...not sure it's a good idea :D [18:49:03] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:46] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:48] urbanecm: how is it unreviewed? It got +1 from Kartik and Thiemo as well as a +2 from the author with good justifications for merging [18:55:09] (03PS1) 10Gergő Tisza: Add Link: fix invalidation on non-addlink edit [extensions/GrowthExperiments] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/711719 (https://phabricator.wikimedia.org/T283606) [18:56:15] as a train blocker we either deploy that, revert the offending patch or delay the train. We should only consider delaying the train until next week in extreme situations. [19:00:00] I don't mind someone else doing it, but I don't feel comfortable doing it myself [19:00:05] jeena and twentyafterfour: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210812T1900). [19:00:48] I'll just note that I've got like 3-4 people messaging me directly about the issue, so it's highly visible. The patch looks simple and good to me. [19:06:01] Nikerabbit: thanks, I've looked deeper at the code and indeed it looks correct. I'm comfortable deploying it [19:06:29] Nikerabbit: maybe you should log out of IRC and enjoy your vacation? ;) [19:06:47] (03CR) 10Herron: "I agree with requiring SSL. Some additional comments relating to that inline." [puppet] - 10https://gerrit.wikimedia.org/r/711741 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [19:07:21] twentyafterfour: would be good, but I don't have separate clients for personal and work use [19:08:50] (03CR) 1020after4: [C: 03+2] "After reviewing the code in question, this does indeed appear to be a correct fix. Given that it's been endorsed by a bunch of people, tes" [extensions/Translate] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/711718 (https://phabricator.wikimedia.org/T288683) (owner: 10Jforrester) [19:26:39] (03Merged) 10jenkins-bot: TranslationPage: Use Title::getPrefixedDBkey when extracting messages [extensions/Translate] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/711718 (https://phabricator.wikimedia.org/T288683) (owner: 10Jforrester) [19:27:18] Hi all, I'm around to test: 711718: TranslationPage: Use Title::getPrefixedDBkey when extracting messages | https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/711718 related to UBN! = https://phabricator.wikimedia.org/T288683 and https://phabricator.wikimedia.org/T288700 [19:29:13] 10SRE, 10MW-on-K8s, 10serviceops, 10Release-Engineering-Team (Radar): The restricted/mediawiki-webserver image should include skins and resources - https://phabricator.wikimedia.org/T285232 (10dancy) I think the issue is that the php-fpm container processing the request is trying to make an outgoing HTTP r... [19:30:34] abijeet: wow thanks! [19:30:39] it's about ready to test ... [19:30:45] (running through ci now) [19:30:55] twentyafterfour, cool [19:31:08] oh it merged. ok I'll deploy [19:32:06] abijeet: should we test with mwdebug first or should I just push this out to prod since it's currently broken? [19:32:43] twentyafterfour, we can roll it out, and I can do a sanity check after that. [19:33:02] twentyafterfour, I'm fairly confident about the patch, since I did deploy it on translatewiki as well. [19:34:04] tacsipacsi, I see you are around, would you be able to mark a page for translation to see that all the translation pages are fixed? [19:34:18] (I unfortunately do not have permission to do that) [19:34:30] Yes, this is why I’m here ;) [19:34:46] thank you! :) [19:34:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:49] abijeet: also I'm more than happy to do more or less any privileged action that might be needed for testing it [19:36:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:50] urbanecm, thank you, I think we are fine this time :) [19:38:09] ok. Offering just in case :) [19:39:16] abijeet: yeah seems safe [19:39:19] I'm syncing it [19:43:12] !log twentyafterfour@deploy1002 Synchronized php-1.37.0-wmf.18/extensions/Translate/src/PageTranslation/TranslationPage.php: sync I2f46abb20145630c27449ce57f1256e92f440144 which should fix T288683 & T288700 thus unblocking the train: T281159 (duration: 01m 07s) [19:43:13] ok it should be ready to test [19:43:21] thanks, checking [19:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:25] T288683: FuzzyBot overwriting fully translated pages with original text - https://phabricator.wikimedia.org/T288683 [19:43:25] T288700: Translation page is no more updated while creating or editing translation units - https://phabricator.wikimedia.org/T288700 [19:43:25] T281159: 1.37.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T281159 [19:45:27] Looks good on my test. Unit translations are working fine, and are triggering a translation page update. [19:46:07] thanks abijeet! much appreciated [19:46:11] I marked Meta:IP block exemption for translation, and the German page is in German again: https://meta.wikimedia.org/w/index.php?title=Meta:IP_block_exemption/de&diff=21875063 [19:46:20] \o/ [19:46:25] thanks everyone! [19:46:44] * urbanecm is wondering if we should bother somehow fixing _all_ pages affected [19:46:56] tacsipacsi, I'm going to be around for 20 mintues or so more. [19:47:18] urbanecm, yes, let me check the command for that. We have a script in the Translate extension to do that. [19:47:50] so shall I mark T288683 as resolved? [19:47:56] oh, great. I was thinking about marking them for translation via API originally. Good there's a script [19:48:40] I’d keep the task open until the script is run, but you can remove it as a train blocker. [19:50:55] (03PS1) 10Aaron Schulz: Avoid udp2log for "objectcache" channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/712548 (https://phabricator.wikimedia.org/T288702) [19:52:18] urbanecm, there is a script under Translate extension: Translate/scripts/refresh-translatable-pages.php --jobqueue; If we run that, it should update all translatable pages. It will spawn a lot of jobs though. [19:52:40] that should be fine, as it's an one-off thing [19:52:47] abijeet: do you want to run it, or should I? [19:54:10] I'm not sure how to do that, and I think its rather late in the day for me to try out new stuff :D [19:54:10] In the sense, I don't know how to do that on the Wikimedia servers. [19:54:44] sorry, back [19:54:48] no problem [19:55:22] abijeet: I'm thinking, is it a good idea to run it w/o --jobqueue? [19:55:34] it will process everything from the maintenance server, but it'd be easy to kill it should that be needed [19:56:18] ok running time would be large, and there could be memory leaks. [19:56:54] the script doesn't appear to be checking if a page is outdated, it just updates everything, so even if a page is processed, it'll process it again if you run the script from scratch. [19:57:53] i guess that means "no" [19:58:08] “doesn't appear to be checking if a page is outdated” – which is actually a good thing here, since these pages are completely up-to-date, just broken. [19:58:30] urbanecm, lets run it via the job queue. [19:58:34] okay okay [19:59:32] I see this old comment from Niklas: https://phabricator.wikimedia.org/T195347#4277953: "This has now been done for all wikis. MetaWiki took about one hour with the updated script, which is really fast." [19:59:33] :D [20:00:14] The updated script here refers to refresh-translatable-pages.php with the job queue option. [20:00:21] sure, let's do it [20:00:34] * urbanecm is trying to find a small-ish wiki where the bug is present [20:04:04] wikidata is one candidate, not a very large number of pages marked for translation: https://www.wikidata.org/wiki/Special:PageTranslation [20:05:07] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:08:23] urbanecm, https://wikimania.wikimedia.org/wiki/Special:PageTranslation this is also an option? [20:09:16] Another one: https://beta.wikiversity.org/wiki/Special:PageTranslation [20:11:03] As far as I see, betawikiversity has no broken pages. Wikimania does have, e.g. https://wikimania.wikimedia.org/wiki/Template:Wikimania_2021_header/fr [20:12:18] thanks both (and sorry for late response, got distracted) [20:12:23] running for wikimania [20:13:30] !log [urbanecm@mwmaint2002 /srv/mediawiki/php]$ mwscript extensions/Translate/scripts/refresh-translatable-pages.php --wiki=wikimaniawiki --jobqueue # T288683 [20:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:39] T288683: FuzzyBot overwriting fully translated pages with original text - https://phabricator.wikimedia.org/T288683 [20:14:15] it says "Queued refresh for 70 translatable pages." [20:15:21] yup, 70 looks correct [20:17:30] fuzzybot edited https://wikimania.wikimedia.org/wiki/Template:Wikimania_2021_header/fr, but that doesn't look like french -- is that an issue? [20:18:24] “This page was last edited on 12 August 2021, at 17:19.” It was before the deployment. [20:19:21] right [20:19:29] this logstash query should help: https://logstash.wikimedia.org/app/discover#/?_g=(filters:!(),query:(language:lucene,query:'*'),refreshInterval:(pause:!t,value:0),time:(from:now-5m,to:now))&_a=h@ea60794 (I think) [20:20:22] abijeet: unfortunately, it says "Unable to completely restore the URL, be sure to use the share functionality." [20:20:44] ah ok, please try this: https://logstash.wikimedia.org/app/discover#/?_g=(filters:!(),query:(language:lucene,query:'*'),refreshInterval:(pause:!t,value:0),time:(from:now-5m,to:now))&_a=(columns:!(_source),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-*',key:wiki,negate:!f,params:(query:wikimaniawiki),type:phrase),query:(match_phrase:(wiki:wikimaniawiki))),('$state':(store:appState),meta:(alias:!n,disabled:!f,i [20:20:44] ndex:'logstash-*',key:channel,negate:!f,params:(query:Translate.Jobs),type:phrase),query:(match_phrase:(channel:Translate.Jobs)))),index:'logstash-*',interval:auto,query:(language:kuery,query:''),sort:!()) [20:20:48] wowk [20:20:59] short link: https://logstash.wikimedia.org/goto/686b459271569b2b57cbc2f159018a84 [20:21:15] that works, thanks! [20:23:22] This particular page is now fixed: https://wikimania.wikimedia.org/w/index.php?title=Template:Wikimania_2021_header/fr&diff=130842 [20:23:29] it's fixed [20:23:45] All jobs are also done. [20:24:12] great [20:24:18] running wikidatawiki now [20:24:31] and then i'll run it everywhere (minus those two) [20:24:57] !log [urbanecm@mwmaint2002 /srv/mediawiki/php]$ mwscript extensions/Translate/scripts/refresh-translatable-pages.php --wiki=wikimaniawiki --jobqueue # T288683 [20:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:05] T288683: FuzzyBot overwriting fully translated pages with original text - https://phabricator.wikimedia.org/T288683 [20:26:02] i expect mediawiki and metawiki to spawn a lot of jobs. I would recommend that we run these individually if possible. [20:26:33] sure, will do abijeet [20:27:04] should i run them right after wikidatawiki finishes, or at the very end? [20:27:39] urbanecm, Will you have to run the refresh-translatable-page command individually for all the wikis? [20:27:55] i'll do a shell for loop [20:28:13] ok, cool. [20:28:37] so it will run sequentially, not in parallel (if that's why you asked me to run them individually) [20:29:47] I'm not sure about the jobqueue in place on the server, is it possible for it to have too many jobs? the script can queue jobs faster than the job queue can run them. [20:32:52] abijeet: i can definitely wait until logstash stops saying something for the two heavy wikis, if that would help [20:33:52] urbanecm, I would also add commons to the "heavy" wiki list. [20:33:58] noted [20:34:15] I'm also looking on https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&from=now-12h&to=now&var-dc=codfw%20prometheus%2Fk8s&var-job=TranslateRenderJob for jobqueue [20:35:00] cool [20:36:13] afk for a bit, will check in again after 20 minutes or so. [20:36:34] tacsipacsi, thanks for your help here and on phab. [20:36:46] tacsipacsi: thanks, too [20:37:03] ack [20:37:41] Happy to help. I was really annoyed by this bug. ;) [20:42:17] (03PS1) 1020after4: EventDispatcher: Remove failing invariant check [extensions/DiscussionTools] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/711720 (https://phabricator.wikimedia.org/T288775) [20:43:03] twentyafterfour: ftr feel free to deploy if you want to [20:43:37] thanks urbanecm, I will shortly [20:43:43] great [20:43:58] * urbanecm is continuing to run the maint script to fix the Translate bug everywhere [20:44:00] (03CR) 1020after4: [C: 03+2] "unblocking the train. T281159" [extensions/DiscussionTools] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/711720 (https://phabricator.wikimedia.org/T288775) (owner: 1020after4) [20:47:55] PROBLEM - mcrouter process on mwmaint2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (mcrouter), command name mcrouter https://wikitech.wikimedia.org/wiki/Mcrouter [20:48:25] PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: mcrouter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:48:45] ^ looking [20:49:19] rzl: i did run a maint script at mwmaint, but it should "just" submit a handful of jobs [20:49:24] (logged few lines above) [20:49:35] (03Merged) 10jenkins-bot: EventDispatcher: Remove failing invariant check [extensions/DiscussionTools] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/711720 (https://phabricator.wikimedia.org/T288775) (owner: 1020after4) [20:50:15] urbanecm: unrelated, you're good to go but thanks for the note! [20:50:29] good to know, thanks rzl [20:50:56] Aug 12 20:42:56 mwmaint2002 mcrouter[1931]: F0812 20:42:56.453008 1935 FiberManagerInternal-inl.h:539] Exception St11logic_error with message 'Some of ssl key paths are not set!' was thrown in FiberManager with context 'running Func functor' [20:51:23] I think this is from https://gerrit.wikimedia.org/r/705852/ but not sure why it took so long to appear [20:52:36] hmm, maybe it didn't fail until there was traffic? so conceivably the maint script might have been the triggering event even though it wasn't the cause [20:53:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:39] RECOVERY - mcrouter process on mwmaint2002 is OK: PROCS OK: 1 process with UID = 114 (mcrouter), command name mcrouter https://wikitech.wikimedia.org/wiki/Mcrouter [20:54:13] RECOVERY - Check systemd state on mwmaint2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:54:23] !log [urbanecm@mwmaint2002 /srv/mediawiki/php]$ mwscript extensions/Translate/scripts/refresh-translatable-pages.php --wiki=testwiki --jobqueue # T288683 [20:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:31] T288683: FuzzyBot overwriting fully translated pages with original text - https://phabricator.wikimedia.org/T288683 [20:54:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:08] and it crashed again -- we may have to roll that patch back [20:56:31] back [20:56:43] !log [urbanecm@mwmaint2002 /srv/mediawiki/php]$ mwscript extensions/Translate/scripts/refresh-translatable-pages.php --wiki=testwikidatawiki --jobqueue # T288683, errored out [20:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:50] abijeet: just in time. the script fataled for testwikidatawiki [20:56:57] https://www.irccloud.com/pastebin/ZdcWYj7G/ [20:57:01] tacsipacsi, the annoyance is understandable. sorry about the mess. [20:57:09] urbanecm, checking [20:57:39] urbanecm: fwiw, even though the script didn't cause the mcrouter failure, I don't know whether the mcrouter failure is going to cause problems for the script [20:57:58] rzl: does mcrouter failure mean the script cannot access memcached? [20:58:05] or does it mean something else? [20:58:23] yeah, memcached would have been either fully or intermittently unavailable [20:58:24] urbanecm, I would just run it again, we've noticed this on translatewiki as well. Reported at https://phabricator.wikimedia.org/T258860 [20:59:08] hmm, running wfMessage('june')->text() raises very similar error [20:59:16] so if you want to be sure, stand by while I debug this, and then I can let you know when that should be cleared up [20:59:25] PROBLEM - mcrouter process on mwmaint2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (mcrouter), command name mcrouter https://wikitech.wikimedia.org/wiki/Mcrouter [20:59:35] so i guess it might be related [20:59:52] waiting for rzl to fix/debug the mcrouter bug [20:59:57] PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: mcrouter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:00:37] oh, yea, I believe this issue is related to memcache, but it happens for us at randomn on translatewiki.net as well [21:00:46] abijeet, it was not to blame anyone, we’re all humans. Thanks for fixing it! [21:01:00] thanks for confirming abijeet. I'll wait. [21:10:21] ok I'm ready to sync https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/711720 [21:10:28] everything clear? [21:10:54] No objections from my end. [21:12:16] syncing [21:13:19] !log twentyafterfour@deploy1002 Synchronized php-1.37.0-wmf.18/extensions/DiscussionTools/includes/Notifications/EventDispatcher.php: sync Ic27418a0ec976347be5fa586bbd32cc4a0d8d511 to unblock the train refs T288775 and T281159 (duration: 01m 07s) [21:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:29] T281159: 1.37.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T281159 [21:13:29] T288775: InvariantException: Invariant failed: Comments are always preceded by headings - https://phabricator.wikimedia.org/T288775 [21:20:48] Train is now unblocked. I'm ready to deploy wmf.18 to all wikis. [21:20:56] (03PS1) 10Ladsgroup: Don't generate HTML when asking for ParserOutput [extensions/SpamBlacklist] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/711721 (https://phabricator.wikimedia.org/T288639) [21:21:09] urbanecm, I'm planning to turn in in about 15 minutes, anything else I can help with? [21:21:31] urbanecm: just fyi, no update yet but still looking -- I'm trying to figure out why this is broken on mwmaint but working elsewhere [21:21:49] I might revert that change just because I'm suspicious of it, but ideally I'd like to get a better sense of what's going on first [21:22:03] abijeet: I don't think so -- unless the script breaks. Thanks a lot for your help. [21:22:35] rzl: should I hold off on the train deployment to all wikis or can that proceed without affecting your debugging? [21:22:50] urbanecm, I'll be around again in about 6 hours. [21:23:02] urbanecm, twentyafterfour, thanks for all your help and patience [21:23:09] any time abijeet [21:23:13] ttyl [21:23:14] abijeet: likewise, thanks for everything [21:23:17] twentyafterfour: thanks for asking, I think you won't affect me -- but in case it affects your deploy, be advised mwmaint2002 can't reliably reach mcrouter right now [21:23:22] *can't reliably reach memcached [21:23:38] rzl: ok thanks, I don't think it's an issue for me [21:23:45] RECOVERY - mcrouter process on mwmaint2002 is OK: PROCS OK: 1 process with UID = 114 (mcrouter), command name mcrouter https://wikitech.wikimedia.org/wiki/Mcrouter [21:24:17] RECOVERY - Check systemd state on mwmaint2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:24:37] hi, once everything is okay, let me know and so I deploy this https://gerrit.wikimedia.org/r/711721 [21:27:18] Amir1: do you want to sync it before I run the train promote? [21:27:27] rather after [21:27:34] ok scapping now [21:27:39] otherwise, I need to backport it to wmf.17 too [21:27:58] (03PS1) 1020after4: all wikis to 1.37.0-wmf.18 refs T281159 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/712626 [21:28:00] (03CR) 1020after4: [C: 03+2] all wikis to 1.37.0-wmf.18 refs T281159 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/712626 (owner: 1020after4) [21:28:50] (03Merged) 10jenkins-bot: all wikis to 1.37.0-wmf.18 refs T281159 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/712626 (owner: 1020after4) [21:30:30] !log twentyafterfour@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.37.0-wmf.18 refs T281159 [21:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:42] T281159: 1.37.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T281159 [21:31:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:21] Amir1: everything looks clear, go ahead and deploy your patch [21:32:28] thanks [21:32:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:01] (03CR) 10Ladsgroup: [C: 03+2] Don't generate HTML when asking for ParserOutput [extensions/SpamBlacklist] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/711721 (https://phabricator.wikimedia.org/T288639) (owner: 10Ladsgroup) [21:44:24] twentyafterfour: I see some new errors in logspam but nothing major [21:44:34] 18 _____▃▄ 2130 2144 ● Error............... .18 i/s/SpecialWhatLinksHere:367 PHP Notice: Undefined offset: -1 [21:44:54] /s/SpecialWhatLinksHere:367 PHP Notice: Trying to get property 'page_id' of non-object [21:45:11] Amir1: yeah I am looking into that now [21:45:17] it's a pretty strange one [21:45:35] kinda looks like some bot fuzzing urls based on the request pattern [21:46:55] I can take a look at it but after I'm with this deployment [21:49:20] it's definitely a result of wildly invalid user input [21:52:18] (03Merged) 10jenkins-bot: Don't generate HTML when asking for ParserOutput [extensions/SpamBlacklist] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/711721 (https://phabricator.wikimedia.org/T288639) (owner: 10Ladsgroup) [21:52:46] !log Run `mwscript extensions/Translate/scripts/refresh-translatable-pages.php --wiki=$WIKI --jobqueue` for a bunch of Translate-enabled wikis (T288683) [21:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:54] T288683: FuzzyBot overwriting fully translated pages with original text - https://phabricator.wikimedia.org/T288683 [21:57:16] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.18/extensions/SpamBlacklist/includes/SpamBlacklistHooks.php: Backport: [[gerrit:711721|Don't generate HTML when asking for ParserOutput (T288639)]] (duration: 00m 58s) [21:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:25] T288639: SpamBlacklistHooks::onEditFilterMergedContent causes every edit to be rendered twice - https://phabricator.wikimedia.org/T288639 [21:59:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Cmjohnson) All the firmware has been updated and the mgmt password set [22:03:56] urbanecm: okay, I can't figure out what's going on here, that stack trace from mcrouter on mwmaint2002 doesn't make sense -- since it seems to be running okay now, I'm going to leave it as-is and file a task for effie to look at in the morning, but if we have trouble again in the meantime, we can try rolling back https://gerrit.wikimedia.org/r/705852 [22:04:12] or, correction, I'm sure it does make sense, but I can't yet figure out how :D [22:04:20] hehe [22:04:39] thanks for looking into it rzl. I started running the script again, and so far it doesn't run into issues [22:05:16] great [22:05:17] i'll leave it to the jobqueue to process spawned jobs now, and will continue later [22:07:05] PROBLEM - Router interfaces on cr3-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:09:53] !log T283867 running userOptions.php on Growth wikis as per T283867#7280296 [22:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:02] T283867: Maintenance script for changing user settings - https://phabricator.wikimedia.org/T283867 [22:17:49] PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: mcrouter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:18:22] ah man, there it goes again :( [22:18:30] okay, I'm going to roll back that patch and see if it helps [22:19:11] PROBLEM - mcrouter process on mwmaint2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (mcrouter), command name mcrouter https://wikitech.wikimedia.org/wiki/Mcrouter [22:21:39] 10SRE, 10serviceops: mcrouter crashing on mwmaint2002 - https://phabricator.wikimedia.org/T288787 (10RLazarus) [22:22:29] (03PS1) 10RLazarus: Revert "mediawiki::mcrouter_wancache: disable ssl listening on mcrouter" [puppet] - 10https://gerrit.wikimedia.org/r/711722 (https://phabricator.wikimedia.org/T288787) [22:24:53] (03CR) 10RLazarus: [C: 03+2] Revert "mediawiki::mcrouter_wancache: disable ssl listening on mcrouter" [puppet] - 10https://gerrit.wikimedia.org/r/711722 (https://phabricator.wikimedia.org/T288787) (owner: 10RLazarus) [22:26:53] RECOVERY - mcrouter process on mwmaint2002 is OK: PROCS OK: 1 process with UID = 114 (mcrouter), command name mcrouter https://wikitech.wikimedia.org/wiki/Mcrouter [22:27:25] RECOVERY - Check systemd state on mwmaint2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:29:02] tgr: FYI your script may not have been able to reach memcached from mwmaint2002 due to T288787; if you got errors maybe retry, it should be resolved [22:29:03] T288787: mcrouter crashing on mwmaint2002 - https://phabricator.wikimedia.org/T288787 [22:33:48] thanks rzl! I don't think it uses memcached [22:34:42] 👍 no action needed unless you had any problems [23:00:05] brennen: Your horoscope predicts another unfortunate US Backport and Config trainingYour patch may or may not be deployed at the sole discretion of the deployer deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210812T2300). [23:00:05] tgr: A patch you scheduled for US Backport and Config trainingYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:12] * thcipriani waves [23:00:36] * thcipriani sitting in for brennen [23:00:40] missing spaces ;D [23:01:15] typical jouncebot [23:01:39] tgr: doing backport training, mind if we deploy your patch? [23:04:49] thcipriani: on the contrary, thanks for doing it [23:05:00] great thanks :) [23:08:07] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:08:12] (03CR) 10Clare Ming: [C: 03+2] Add Link: fix invalidation on non-addlink edit [extensions/GrowthExperiments] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/711719 (https://phabricator.wikimedia.org/T283606) (owner: 10Gergő Tisza) [23:24:48] (03PS1) 10Zabe: Set archive namespaces on foundationwiki to 'noindex,follow' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/712732 (https://phabricator.wikimedia.org/T288763) [23:24:51] apparently jouncebot is hard-coded to skip the "Max 6 patches" part in the event title. But it's not hard-coded to skip the "Your patch may or may not..." part so that gets squished into the announcement. [23:25:16] we should probably have separate template fields for the event name and notes. [23:26:31] (03Merged) 10jenkins-bot: Add Link: fix invalidation on non-addlink edit [extensions/GrowthExperiments] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/711719 (https://phabricator.wikimedia.org/T283606) (owner: 10Gergő Tisza) [23:26:52] Is there also time to deploy a config patch from me? [23:29:03] zabe: depends on the config patch :) [23:29:17] link? [23:29:19] tgr: can you test on mwdebug2002? [23:29:35] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/712732 [23:30:36] zabe: sure we can do that one [23:31:46] looking [23:31:55] cool :) [23:33:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:39] cjming: the patch doesn't break anything. On mwdebug it doesn't fix the bug it should be fixing, but involves the job queue so I think that's expected. [23:36:59] I'll test fully when it is live. [23:37:19] so i'll go ahead and sync [23:38:55] !log cjming@deploy1002 Synchronized php-1.37.0-wmf.18/extensions/GrowthExperiments: Backport: [[gerrit:711719|Add Link: fix invalidation on non-addlink edit (T283606)]] (duration: 01m 00s) [23:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:02] T283606: Add a link: too many articles have no suggestions upon arrival - https://phabricator.wikimedia.org/T283606 [23:39:31] tgr: you're gtg [23:40:30] zabe: you're up next :) [23:40:41] (03CR) 10D3r1ck01: [C: 03+2] Set archive namespaces on foundationwiki to 'noindex,follow' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/712732 (https://phabricator.wikimedia.org/T288763) (owner: 10Zabe) [23:41:13] thanks cjming! It's working as expected. [23:41:25] \o/ [23:41:30] (03Merged) 10jenkins-bot: Set archive namespaces on foundationwiki to 'noindex,follow' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/712732 (https://phabricator.wikimedia.org/T288763) (owner: 10Zabe) [23:42:26] (03PS1) 10Zabe: Add extendedconfirmed on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/712754 (https://phabricator.wikimedia.org/T287322) [23:43:35] zabe: it's now on mwdebug2002, please can you test it out? :) [23:44:41] yes, doing [23:45:05] zabe: awesome! :) [23:45:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:55] xSavitar: looks good to me [23:48:18] zabe: great! going forward to push live now. [23:50:44] !log derick@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:712732|Set archive namespaces on foundationwiki to 'noindex,follow' (T288763)]] (duration: 00m 59s) [23:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:51] T288763: Add Archive namespace on Governance Wiki to robots to block search indexing - https://phabricator.wikimedia.org/T288763 [23:50:58] zabe: code is now live [23:52:27] thanks :) [23:57:02] zabe: You're welcome! :)