[00:19:46] (03PS1) 10Seddon: Investigate MediaSearch usability on other wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697694 (https://phabricator.wikimedia.org/T278984) [00:32:47] (Traffic bill over quota) firing: Traffic bill over quota - https://alerts.wikimedia.org [00:37:47] (Traffic bill over quota) resolved: Traffic bill over quota - https://alerts.wikimedia.org [00:52:33] (03PS1) 10Dzahn: add httpd site config and index.html for static-bugzilla [container/miscweb] - 10https://gerrit.wikimedia.org/r/697695 (https://phabricator.wikimedia.org/T281538) [00:58:39] (03CR) 10Dzahn: [C: 03+2] "blubber .pipeline/blubber.yaml staging | sudo docker build --tag nologging --file - . && sudo docker run --rm --interactive --tty nologgin" [container/miscweb] - 10https://gerrit.wikimedia.org/r/697695 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [01:00:03] (03Merged) 10jenkins-bot: add httpd site config and index.html for static-bugzilla [container/miscweb] - 10https://gerrit.wikimedia.org/r/697695 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [01:07:01] 10SRE, 10Services, 10Patch-For-Review, 10Service-deployment-requests: New Service Request miscweb - https://phabricator.wikimedia.org/T281538 (10Dzahn) ` git clone "https://gerrit.wikimedia.org/r/operations/container/miscweb" blubber .pipeline/blubber.yaml staging | sudo docker build --tag nologging --fil... [01:26:11] PROBLEM - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Phabricator [01:26:51] well that explains why phab was slow [01:27:26] twentyafterfour: you there? ^^ [01:27:33] 👀 👋 [01:28:07] bacula is taking up 75% of the CPU on phab1001 [01:28:19] or maybe 75% of one CPU [01:42:27] RECOVERY - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 37905 bytes in 0.129 second response time https://wikitech.wikimedia.org/wiki/Phabricator [02:24:55] legoktm: hmm weird [02:25:40] twentyafterfour: see _security [02:26:26] Cannot join channel (+i) - you must be invited [02:28:32] I never got invited after the chat network switch [04:00:22] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:05:46] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: daily_account_consistency_check.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:25:42] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 80 probes of 626 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:31:11] (03PS1) 10Marostegui: Revert "db1146: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/697553 [04:31:30] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 43 probes of 626 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:32:33] (03CR) 10Marostegui: [C: 03+2] Revert "db1146: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/697553 (owner: 10Marostegui) [04:56:10] !log Test [04:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:27] !log restart tcpircbot-logmsgbot on alert1001 - T284123 [05:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:31] T284123: dbctl !log doesn't work - https://phabricator.wikimedia.org/T284123 [05:17:34] test message volans [05:17:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 100%: Repool db1146:3314', diff saved to https://phabricator.wikimedia.org/P16250 and previous config saved to /var/cache/conftool/dbconfig/20210602-051738-root.json [05:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:48] ok it seems back marostegui (see ^^^ ) [05:17:52] volans|off: ^ [05:17:56] \o/ [05:18:00] thanks! [05:18:07] np :) [05:18:30] !log razzi@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1183.eqiad.wmnet [05:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1144:3314', diff saved to https://phabricator.wikimedia.org/P16251 and previous config saved to /var/cache/conftool/dbconfig/20210602-051919-marostegui.json [05:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:44] (03PS1) 10Marostegui: db1144: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/697701 [05:22:19] (03CR) 10Marostegui: [C: 03+2] db1144: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/697701 (owner: 10Marostegui) [05:24:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=mysql-test site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:28:20] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1183.eqiad.wmnet [05:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:03] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [REPLAY FROM 2021-06-01 19:29:26] [05:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:41] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [REPLAY FROM 2021-06-01 19:42:38] [05:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:32] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:697671|Fix pageterms API call for Special:Nearby in Wikidata (T281639)]] (duration: 00m 56s) [REPLAY FROM 2021-06-01 21:44:06] [05:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:36] T281639: Special:Nearby not showing labels - https://phabricator.wikimedia.org/T281639 [05:32:42] !log razzi@cumin1001 START - Cookbook sre.dns.netbox [05:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 25%: Repool db1146:3314', diff saved to https://phabricator.wikimedia.org/P16245 and previous config saved to /var/cache/conftool/dbconfig/20210602-043227-root.json [REPLAY FROM 2021-06-02 04:32:27] [05:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 50%: Repool db1146:3314', diff saved to https://phabricator.wikimedia.org/P16246 and previous config saved to /var/cache/conftool/dbconfig/20210602-044730-root.json [REPLAY FROM 2021-06-02 04:47:31] [05:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2071', diff saved to https://phabricator.wikimedia.org/P16247 and previous config saved to /var/cache/conftool/dbconfig/20210602-045717-marostegui.json [REPLAY FROM 2021-06-02 04:57:17] [05:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2071', diff saved to https://phabricator.wikimedia.org/P16248 and previous config saved to /var/cache/conftool/dbconfig/20210602-045736-marostegui.json [REPLAY FROM 2021-06-02 04:57:36] [05:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:12] !log razzi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [05:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 75%: Repool db1146:3314', diff saved to https://phabricator.wikimedia.org/P16249 and previous config saved to /var/cache/conftool/dbconfig/20210602-050234-root.json [REPLAY FROM 2021-06-02 05:02:34] [05:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:01] ok I should have done replaying all ! log messages [05:41:53] thanks volans|off! [05:43:38] 10SRE: tcpircbot-logmsgbot was not able to deliver messages - https://phabricator.wikimedia.org/T284123 (10Volans) p:05High→03Medium [05:54:07] (03PS1) 10Razzi: Add dbstore1006 and dbstore1007 to analytics firewall [homer/public] - 10https://gerrit.wikimedia.org/r/697704 [05:55:16] (03CR) 10Marostegui: Add dbstore1006 and dbstore1007 to analytics firewall (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/697704 (owner: 10Razzi) [05:55:59] (03PS2) 10Razzi: Add dbstore1006 and dbstore1007 to analytics firewall [homer/public] - 10https://gerrit.wikimedia.org/r/697704 [05:56:20] (03PS3) 10Razzi: Add dbstore1007 to analytics firewall [homer/public] - 10https://gerrit.wikimedia.org/r/697704 (https://phabricator.wikimedia.org/T283125) [05:57:09] (03CR) 10Marostegui: "The IP looks good to me!" [homer/public] - 10https://gerrit.wikimedia.org/r/697704 (https://phabricator.wikimedia.org/T283125) (owner: 10Razzi) [06:01:43] (03CR) 10MSantos: [C: 03+1] "LGTM. One minor nit." (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/685799 (https://phabricator.wikimedia.org/T281976) (owner: 10Jgiannelos) [06:02:56] (03CR) 10MSantos: [C: 03+1] cassandra: drop support for 2.1 in metrics. Fix collector version [puppet] - 10https://gerrit.wikimedia.org/r/696399 (https://phabricator.wikimedia.org/T275353) (owner: 10Hnowlan) [06:04:26] (03CR) 10Volans: [C: 03+1] "LGTM, just a couple of typos and a final nit. Consider it a +1 from me without re-reviewing the next patch." (0316 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/696377 (https://phabricator.wikimedia.org/T274527) (owner: 10Muehlenhoff) [06:06:56] (03PS1) 10Razzi: site: add dbstore1007, reimaged from db1183 [puppet] - 10https://gerrit.wikimedia.org/r/697705 (https://phabricator.wikimedia.org/T283125) [06:12:30] (03CR) 10Volans: "Sorry for the long delay, this slipped through the TODO... Is it still valid and in need of review? I left a couple of high level comments" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/649933 (https://phabricator.wikimedia.org/T268211) (owner: 10Jbond) [06:13:55] (03CR) 10Marostegui: [C: 03+1] site: add dbstore1007, reimaged from db1183 [puppet] - 10https://gerrit.wikimedia.org/r/697705 (https://phabricator.wikimedia.org/T283125) (owner: 10Razzi) [06:17:12] (03CR) 10Razzi: [C: 03+2] site: add dbstore1007, reimaged from db1183 [puppet] - 10https://gerrit.wikimedia.org/r/697705 (https://phabricator.wikimedia.org/T283125) (owner: 10Razzi) [06:34:41] !log razzi@cumin1001 START - Cookbook sre.dns.netbox [06:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:21] !log razzi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:56] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbstore1007.eqiad.wmnet with reason: REIMAGE [06:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:03] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1007.eqiad.wmnet with reason: REIMAGE [06:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:46] (03PS1) 10Razzi: site: give mariadb role to dbstore1007 [puppet] - 10https://gerrit.wikimedia.org/r/697706 (https://phabricator.wikimedia.org/T283125) [06:58:07] (03CR) 10Muehlenhoff: Cookbook to add a new node to a Ganeti cluster (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/696377 (https://phabricator.wikimedia.org/T274527) (owner: 10Muehlenhoff) [06:58:48] (03PS10) 10Muehlenhoff: Cookbook to add a new node to a Ganeti cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/696377 (https://phabricator.wikimedia.org/T274527) [07:02:25] (03CR) 10jerkins-bot: [V: 04-1] Cookbook to add a new node to a Ganeti cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/696377 (https://phabricator.wikimedia.org/T274527) (owner: 10Muehlenhoff) [07:09:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:09:34] (03PS11) 10Muehlenhoff: Cookbook to add a new node to a Ganeti cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/696377 (https://phabricator.wikimedia.org/T274527) [07:20:11] (03PS12) 10Muehlenhoff: Cookbook to add a new node to a Ganeti cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/696377 (https://phabricator.wikimedia.org/T274527) [07:23:09] 10SRE, 10SRE-Access-Requests: Allow JStephenson to access Superset - https://phabricator.wikimedia.org/T282515 (10JStephenson) Yes, fantastic. Thanks for all the support. [07:23:50] (03PS1) 10Muehlenhoff: Update MOU date for piccardi [puppet] - 10https://gerrit.wikimedia.org/r/697708 [07:24:59] (03CR) 10Elukey: [C: 04-1] "Precautionary -1 since I am seeing some inconsistency between AAAA and A records." (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/697704 (https://phabricator.wikimedia.org/T283125) (owner: 10Razzi) [07:27:48] 10SRE, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, and 2 others: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10schoenbaechler) >>! In T283190#7119438, @Marostegui wrote: > @schoenbaechler can you please confirm you've read and s... [07:28:22] (03CR) 10Muehlenhoff: [C: 03+2] Update MOU date for piccardi [puppet] - 10https://gerrit.wikimedia.org/r/697708 (owner: 10Muehlenhoff) [07:29:17] (03CR) 10Muehlenhoff: [C: 03+2] Cookbook to add a new node to a Ganeti cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/696377 (https://phabricator.wikimedia.org/T274527) (owner: 10Muehlenhoff) [07:31:20] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install new linecards into routers - https://phabricator.wikimedia.org/T277339 (10ayounsi) Yep. [07:32:14] (03Merged) 10jenkins-bot: Cookbook to add a new node to a Ganeti cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/696377 (https://phabricator.wikimedia.org/T274527) (owner: 10Muehlenhoff) [07:39:41] (03CR) 10Elukey: "For db replication, I think that we should make sure that airflow users/grants are on all instances, the rest should happen without proble" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/697653 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [07:44:49] 10SRE, 10ops-eqiad, 10Traffic: cp1087 down with hardware issues - https://phabricator.wikimedia.org/T278729 (10ema) [07:44:53] !log installing squid security updates [07:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:19] (03CR) 10Marostegui: "Agreed Luca!" [homer/public] - 10https://gerrit.wikimedia.org/r/697704 (https://phabricator.wikimedia.org/T283125) (owner: 10Razzi) [07:51:24] 10SRE, 10Wikimedia-Mailing-lists: Old mailing list info page URLs are 404s when listname is written with capital letter - https://phabricator.wikimedia.org/T284124 (10Aklapper) [08:12:42] (03PS1) 10Ema: Netops team alert: ping offload [alerts] - 10https://gerrit.wikimedia.org/r/697710 (https://phabricator.wikimedia.org/T282806) [08:12:52] !log removed eight inactive addresses from ops@ list [08:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:53] 10SRE: Create cookbook to add a node to a Ganeti cluster - https://phabricator.wikimedia.org/T274527 (10MoritzMuehlenhoff) The cookbook has been written, but I'm keeping the task open for now; I'll be some more finetuning when the capex for eqiad/codfw arrives. [08:37:55] (03CR) 10Muehlenhoff: "Looks great, one comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/697625 (https://phabricator.wikimedia.org/T164456) (owner: 10Jbond) [08:44:17] (03PS1) 10David Caro: wmcs.toolschecker: update the tools etcd nodes [puppet] - 10https://gerrit.wikimedia.org/r/697714 (https://phabricator.wikimedia.org/T284131) [08:45:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Nice TODO right there :-)" [puppet] - 10https://gerrit.wikimedia.org/r/697714 (https://phabricator.wikimedia.org/T284131) (owner: 10David Caro) [08:48:49] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/697605 (owner: 10Ottomata) [08:49:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Please collect +1 from others before merging." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697102 (owner: 10Majavah) [08:49:44] (03CR) 10David Caro: [C: 03+2] "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/697714 (https://phabricator.wikimedia.org/T284131) (owner: 10David Caro) [08:54:34] (03PS1) 10David Caro: ceph.mon: don't subscribe to ceph.conf [puppet] - 10https://gerrit.wikimedia.org/r/697715 [08:55:16] (03CR) 10jerkins-bot: [V: 04-1] ceph.mon: don't subscribe to ceph.conf [puppet] - 10https://gerrit.wikimedia.org/r/697715 (owner: 10David Caro) [09:01:11] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/697626 (https://phabricator.wikimedia.org/T164456) (owner: 10Jbond) [09:02:47] (03CR) 10Jbond: "ill leave mine as a 0 as i think my -1's are the same as CI's 😊, other comments are just nits" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [09:03:19] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/697627 (https://phabricator.wikimedia.org/T164456) (owner: 10Jbond) [09:10:03] 10SRE, 10Traffic, 10netops, 10User-jbond: varnish filtering: should we automatically update public_cloud_nets - https://phabricator.wikimedia.org/T270391 (10jbond) > I personally don't see Netbox as the right place for those, at least as prefixes Ack, I think in thtat case the best way forward for now is t... [09:13:07] (03CR) 10Jbond: [C: 03+1] "LGTM assuming all approvals" [puppet] - 10https://gerrit.wikimedia.org/r/697677 (https://phabricator.wikimedia.org/T283368) (owner: 10Cwhite) [09:14:24] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/697679 (https://phabricator.wikimedia.org/T283189) (owner: 10Cwhite) [09:14:38] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/Translate/scripts/moveTranslatablePage.php --wiki=metawiki --reason='OTRS -> VRTS renaming process; see [[Phab:T280392]] and [[Phab:T280396]] ([[:phab:T284118|request]])' 'OTRS' 'VRT' 'Quiddity (WMF)' # T284118 [09:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:46] T284118: Move translatable page OTRS - https://phabricator.wikimedia.org/T284118 [09:14:46] T280392: Migrate Wikimedia away from OTRS software and branding - https://phabricator.wikimedia.org/T280392 [09:14:47] T280396: Replace OTRS text on Meta - https://phabricator.wikimedia.org/T280396 [09:15:53] (03CR) 10Jbond: [C: 03+1] "LGTM assuming approvals" [puppet] - 10https://gerrit.wikimedia.org/r/697682 (https://phabricator.wikimedia.org/T284109) (owner: 10Cwhite) [09:16:34] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/697684 (https://phabricator.wikimedia.org/T283057) (owner: 10Cwhite) [09:22:39] (03PS2) 10Jbond: P:nginx: add an nginx profile [puppet] - 10https://gerrit.wikimedia.org/r/697625 (https://phabricator.wikimedia.org/T164456) [09:23:02] (03CR) 10Jbond: "updated thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/697625 (https://phabricator.wikimedia.org/T164456) (owner: 10Jbond) [09:38:15] 10SRE, 10Data-Persistence-Backup, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10cmooney) Thanks for the info @jcrespo that should help. I did a lot of tests yesterday in relation to th... [09:42:10] (03CR) 10Jbond: "> Patch Set 4:" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/649933 (https://phabricator.wikimedia.org/T268211) (owner: 10Jbond) [09:42:37] 10SRE, 10Patch-For-Review, 10User-jbond: Filter (if possible) downtimed hosts from check_puppet_run_changes.py's report - https://phabricator.wikimedia.org/T268211 (10jbond) [09:43:46] (03PS1) 10Filippo Giunchedi: alertmanager: attach runbook/dashboard URLs to IRC messages [puppet] - 10https://gerrit.wikimedia.org/r/697721 (https://phabricator.wikimedia.org/T282806) [09:43:47] (03PS1) 10Filippo Giunchedi: alertmanager: add a sample JSON alert and instruction on how to test IRC format changes [puppet] - 10https://gerrit.wikimedia.org/r/697722 (https://phabricator.wikimedia.org/T282806) [09:44:31] (03CR) 10jerkins-bot: [V: 04-1] alertmanager: add a sample JSON alert and instruction on how to test IRC format changes [puppet] - 10https://gerrit.wikimedia.org/r/697722 (https://phabricator.wikimedia.org/T282806) (owner: 10Filippo Giunchedi) [09:44:42] (03PS17) 10Jbond: profile::contacts: add a profile and define for adding contact metadata [puppet] - 10https://gerrit.wikimedia.org/r/695230 (https://phabricator.wikimedia.org/T216088) [09:44:59] (03PS17) 10Jbond: (Test): Example PR demonstrating the contacts profile [puppet] - 10https://gerrit.wikimedia.org/r/695236 (https://phabricator.wikimedia.org/T216088) [09:45:14] (03PS18) 10Jbond: (Test): Example PR demonstrating the contacts profile [puppet] - 10https://gerrit.wikimedia.org/r/695236 (https://phabricator.wikimedia.org/T216088) [09:45:51] (03CR) 10jerkins-bot: [V: 04-1] (Test): Example PR demonstrating the contacts profile [puppet] - 10https://gerrit.wikimedia.org/r/695236 (https://phabricator.wikimedia.org/T216088) (owner: 10Jbond) [09:46:06] 10SRE, 10Patch-For-Review, 10User-jbond: Mapping of servers to stakeholders - https://phabricator.wikimedia.org/T216088 (10jbond) [09:48:37] 10SRE, 10Gerrit, 10LDAP-Access-Requests: Add dancy to `archiva-deployers` LDAP group - https://phabricator.wikimedia.org/T283347 (10hashar) Thank you :] [09:48:53] 10SRE, 10SRE-Access-Requests: Add jgianellos and mbsantos to maps-root group - https://phabricator.wikimedia.org/T284135 (10hnowlan) [09:51:28] !log kormat@cumin1001 START - Cookbook sre.hosts.decommission for hosts dbstore1006.eqiad.wmnet [09:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:57] (03PS2) 10Filippo Giunchedi: alertmanager: add a sample JSON alert and instruction on how to test IRC format changes [puppet] - 10https://gerrit.wikimedia.org/r/697722 (https://phabricator.wikimedia.org/T282806) [09:58:11] (03PS1) 10Jbond: sudo: drop keep_env option [puppet] - 10https://gerrit.wikimedia.org/r/697723 (https://phabricator.wikimedia.org/T275852) [09:58:39] (03CR) 10jerkins-bot: [V: 04-1] alertmanager: add a sample JSON alert and instruction on how to test IRC format changes [puppet] - 10https://gerrit.wikimedia.org/r/697722 (https://phabricator.wikimedia.org/T282806) (owner: 10Filippo Giunchedi) [10:00:02] (03PS3) 10Filippo Giunchedi: alertmanager: add a sample JSON alert and test instructions [puppet] - 10https://gerrit.wikimedia.org/r/697722 (https://phabricator.wikimedia.org/T282806) [10:01:26] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dbstore1006.eqiad.wmnet [10:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:16] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:12:45] 10SRE, 10Traffic, 10netops, 10User-jbond: varnish filtering: should we automatically update public_cloud_nets - https://phabricator.wikimedia.org/T270391 (10ayounsi) I still think filtering public clouds on their ASN (with MaxMind DB) is the most sustainable path until T270618. Having to maintain multiple... [10:16:38] (03PS5) 10Jbond: cumin: Add check_puppet_run_script so we can filter based on icinga status [puppet] - 10https://gerrit.wikimedia.org/r/649933 (https://phabricator.wikimedia.org/T268211) [10:17:31] 10SRE, 10SRE-Access-Requests: Add jgianellos and mbsantos to maps-root group - https://phabricator.wikimedia.org/T284135 (10MSantos) [10:17:31] !log kormat@cumin1001 START - Cookbook sre.dns.netbox [10:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:18] !log kormat@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:18] !log Commit pfw policy 1622570851 to pfw3-codfw and pfw3-eqiad to support new host fran2001 (T282056) [10:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:22] T282056: (Need By: TBD) rack/setup/install fran2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T282056 [10:48:40] 10SRE, 10CAS-SSO, 10User-jbond: Apereo CAS expose CASCookieSameSite via profile::idp::client::http - https://phabricator.wikimedia.org/T264605 (10jbond) I have created a package with[[ https://github.com/apereo/mod_auth_cas/pull/190 | SameSite Cookie support ]] and [[ https://github.com/apereo/mod_auth_cas/... [10:52:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. Please collect +1 from Brooke before merging." [puppet] - 10https://gerrit.wikimedia.org/r/697130 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [10:55:57] jouncebot: next [10:55:57] In 0 hour(s) and 4 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210602T1100) [10:58:57] Ready and reporting [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Dear deployers, time to do the European mid-day backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210602T1100). [11:00:05] matej_suchanek and Seddon: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:07] o/ [11:00:09] I can deploy today [11:00:11] hello Seddon :) [11:00:21] Hey Martin! [11:00:21] matej_suchanek: are you around? [11:00:29] yes martin :) [11:00:30] (03CR) 10Urbanecm: [C: 03+2] Investigate MediaSearch usability on other wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697694 (https://phabricator.wikimedia.org/T278984) (owner: 10Seddon) [11:00:35] (03CR) 10Urbanecm: [C: 03+2] InfoAction: Cast wgNamespaceProtection to array [core] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695322 (https://phabricator.wikimedia.org/T283751) (owner: 10Jforrester) [11:00:54] great matej_suchanek. [11:01:03] matej_suchanek: Seddon: I'll ping you both when patch is ready for testing. [11:01:23] (03Merged) 10jenkins-bot: Investigate MediaSearch usability on other wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697694 (https://phabricator.wikimedia.org/T278984) (owner: 10Seddon) [11:01:39] !log upload libapache2-mod-auth-cas_1.2-1+wmf11u1_amd64.deb - #T264605 [11:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:44] T264605: Apereo CAS expose CASCookieSameSite via profile::idp::client::http - https://phabricator.wikimedia.org/T264605 [11:02:27] Seddon: your patch is available at mwdebug1001, can you test please? [11:04:54] !log upload libapache2-mod-auth-cas_1.2-1 for buster and stretch - #T264605 [11:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:14] urbanecm: That change worked :) [11:05:20] great, syncing [11:06:33] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: f12e368481b6836eefa070ad5dcf52af3f39d479: Investigate MediaSearch usability on other wikis (T278984) (duration: 00m 57s) [11:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:37] T278984: [M] [SPIKE] Investigate how easy/hard it is to make MediaSearch usable on other wikis - https://phabricator.wikimedia.org/T278984 [11:06:39] Seddon: should be live [11:08:33] !log update mod_auth_cas T264605 [11:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:37] T264605: Apereo CAS expose CASCookieSameSite via profile::idp::client::http - https://phabricator.wikimedia.org/T264605 [11:09:26] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/697625 (https://phabricator.wikimedia.org/T164456) (owner: 10Jbond) [11:09:48] urbanecm: confirmed, it is live [11:09:56] great. Anything else I can do for you Seddon ? [11:10:11] urbanecm: A coffee would be great [11:10:48] Unfortunately, you're bit far for that :( [11:22:46] (03PS2) 10David Caro: ceph.mon: don't subscribe to ceph.conf [puppet] - 10https://gerrit.wikimedia.org/r/697715 [11:22:52] (03CR) 10David Caro: [C: 03+2] wmcs.do_log_msg: Fixed to use the new correct port [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/696505 (owner: 10David Caro) [11:22:57] (03CR) 10David Caro: [C: 03+2] cloudvirt.*: adding sal messages to all the cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/696453 (owner: 10David Caro) [11:23:03] (03CR) 10David Caro: [C: 03+2] cloudvirt.{drain|safe_reboot}: use default control node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/696448 (owner: 10David Caro) [11:23:06] (03CR) 10David Caro: [C: 03+2] cloudvirt.*maintenante: use a default cloudcontrol node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/696447 (owner: 10David Caro) [11:23:10] (03CR) 10David Caro: [C: 03+2] unset_maintenance: don't set downtime on icinga [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/696446 (owner: 10David Caro) [11:23:25] (03Merged) 10jenkins-bot: InfoAction: Cast wgNamespaceProtection to array [core] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695322 (https://phabricator.wikimedia.org/T283751) (owner: 10Jforrester) [11:24:15] matej_suchanek: your patch is at mwdebug1001. Can you test it please? [11:25:09] urbanecm: yes, it is fixed [11:25:23] great, syncing. [11:25:53] (03Merged) 10jenkins-bot: unset_maintenance: don't set downtime on icinga [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/696446 (owner: 10David Caro) [11:25:55] (03Merged) 10jenkins-bot: cloudvirt.*maintenante: use a default cloudcontrol node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/696447 (owner: 10David Caro) [11:26:49] (03Merged) 10jenkins-bot: cloudvirt.{drain|safe_reboot}: use default control node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/696448 (owner: 10David Caro) [11:26:51] (03Merged) 10jenkins-bot: cloudvirt.*: adding sal messages to all the cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/696453 (owner: 10David Caro) [11:26:53] (03Merged) 10jenkins-bot: wmcs.do_log_msg: Fixed to use the new correct port [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/696505 (owner: 10David Caro) [11:27:03] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.7/includes/actions/InfoAction.php: 85feaa15d9bbda130541adb6302f31c4372e6519: InfoAction: Cast wgNamespaceProtection to array (T283751) (duration: 01m 00s) [11:27:06] matej_suchanek: here you go. Should be live! Anything else? [11:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:08] T283751: PHP Notice: Array to string conversion - https://phabricator.wikimedia.org/T283751 [11:27:46] urbanecm: confirm, works in prod, nothing else, need to go [11:28:05] excellent :). See you later then. [11:28:22] bye [11:32:21] (03PS1) 10Jbond: P:idp::client::http:site: add support for same site cookie [puppet] - 10https://gerrit.wikimedia.org/r/697730 (https://phabricator.wikimedia.org/T264605) [11:33:40] (03PS2) 10Jbond: P:idp::client::http:site: add support for same site cookie [puppet] - 10https://gerrit.wikimedia.org/r/697730 (https://phabricator.wikimedia.org/T264605) [11:35:08] (03CR) 10jerkins-bot: [V: 04-1] P:idp::client::http:site: add support for same site cookie [puppet] - 10https://gerrit.wikimedia.org/r/697730 (https://phabricator.wikimedia.org/T264605) (owner: 10Jbond) [11:35:30] (03PS3) 10Jbond: P:idp::client::http:site: add support for same site cookie [puppet] - 10https://gerrit.wikimedia.org/r/697730 (https://phabricator.wikimedia.org/T264605) [11:36:53] (03PS4) 10Jbond: P:idp::client::http:site: add support for same site cookie [puppet] - 10https://gerrit.wikimedia.org/r/697730 (https://phabricator.wikimedia.org/T264605) [11:36:59] (03CR) 10jerkins-bot: [V: 04-1] P:idp::client::http:site: add support for same site cookie [puppet] - 10https://gerrit.wikimedia.org/r/697730 (https://phabricator.wikimedia.org/T264605) (owner: 10Jbond) [11:38:18] (03CR) 10jerkins-bot: [V: 04-1] P:idp::client::http:site: add support for same site cookie [puppet] - 10https://gerrit.wikimedia.org/r/697730 (https://phabricator.wikimedia.org/T264605) (owner: 10Jbond) [11:38:46] PROBLEM - MariaDB Replica Lag: s8 on db2152 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 78091.62 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:39:00] PROBLEM - MariaDB Replica Lag: s8 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 78105.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:39:00] PROBLEM - MariaDB Replica Lag: s8 on db2083 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 78105.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:39:28] PROBLEM - MariaDB Replica Lag: s8 on db2085 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 78132.82 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:39:42] marostegui or kormat ^downtime expired? [11:39:43] checking [11:39:45] yeah [11:39:46] PROBLEM - MariaDB Replica Lag: s8 on db2081 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 78150.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:39:47] most likely [11:40:06] yeah [11:40:07] it is expired [11:40:13] downtimed again [11:40:14] PROBLEM - MariaDB Replica Lag: s8 on db2082 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 78177.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:40:14] PROBLEM - MariaDB Replica Lag: s8 on db2080 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 78178.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:40:35] will recover in a sec [11:40:38] (03PS5) 10Jbond: P:idp::client::http:site: add support for same site cookie [puppet] - 10https://gerrit.wikimedia.org/r/697730 (https://phabricator.wikimedia.org/T264605) [11:41:16] RECOVERY - MariaDB Replica Lag: s8 on db2085 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:41:32] RECOVERY - MariaDB Replica Lag: s8 on db2081 is OK: OK slave_sql_lag Replication lag: 0.05 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:41:41] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/29775/" [puppet] - 10https://gerrit.wikimedia.org/r/697730 (https://phabricator.wikimedia.org/T264605) (owner: 10Jbond) [11:42:00] RECOVERY - MariaDB Replica Lag: s8 on db2082 is OK: OK slave_sql_lag Replication lag: 0.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:42:02] RECOVERY - MariaDB Replica Lag: s8 on db2080 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:42:22] RECOVERY - MariaDB Replica Lag: s8 on db2152 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:42:36] RECOVERY - MariaDB Replica Lag: s8 on db2086 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:42:36] RECOVERY - MariaDB Replica Lag: s8 on db2083 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:48:26] (03CR) 10Jbond: [C: 03+2] P:nginx: add an nginx profile [puppet] - 10https://gerrit.wikimedia.org/r/697625 (https://phabricator.wikimedia.org/T164456) (owner: 10Jbond) [11:48:34] (03PS2) 10Jbond: O:puppetmaster::puppetdb: add nginx profile to puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/697626 (https://phabricator.wikimedia.org/T164456) [11:48:52] (03PS2) 10Jbond: O:puppetmatser::puppetdb: switch puppetdb to use nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/697627 (https://phabricator.wikimedia.org/T164456) [11:52:06] (03CR) 10Jbond: [C: 03+2] O:puppetmaster::puppetdb: add nginx profile to puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/697626 (https://phabricator.wikimedia.org/T164456) (owner: 10Jbond) [11:54:26] !log disable puppet fleet wide. changing puppetdb to use nginx-light #T164456 [11:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:32] T164456: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456 [11:56:04] (03CR) 10Jbond: [C: 03+2] O:puppetmatser::puppetdb: switch puppetdb to use nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/697627 (https://phabricator.wikimedia.org/T164456) (owner: 10Jbond) [11:57:15] (03PS1) 10Nikki Nikkhoui: Helmfile for image suggestion api [deployment-charts] - 10https://gerrit.wikimedia.org/r/697733 (https://phabricator.wikimedia.org/T281257) [11:58:54] (03CR) 10jerkins-bot: [V: 04-1] Helmfile for image suggestion api [deployment-charts] - 10https://gerrit.wikimedia.org/r/697733 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui) [12:02:49] (03PS1) 10Muehlenhoff: Create debmonitor user on buster with adduser [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/697734 (https://phabricator.wikimedia.org/T256098) [12:04:19] (03PS1) 10Jbond: admin - jbond: update . files [puppet] - 10https://gerrit.wikimedia.org/r/697735 [12:05:20] (03PS1) 10Kormat: db1125: Rename back from dbstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/697736 (https://phabricator.wikimedia.org/T284128) [12:05:49] !log enable puppet fleet wide. post changing puppetdb to use nginx-light #T164456 [12:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:55] T164456: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456 [12:06:56] (03CR) 10Marostegui: [C: 03+1] "thanks :*" [puppet] - 10https://gerrit.wikimedia.org/r/697736 (https://phabricator.wikimedia.org/T284128) (owner: 10Kormat) [12:06:58] (03CR) 10Jbond: [C: 03+2] admin - jbond: update . files [puppet] - 10https://gerrit.wikimedia.org/r/697735 (owner: 10Jbond) [12:07:02] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29776/console" [puppet] - 10https://gerrit.wikimedia.org/r/697736 (https://phabricator.wikimedia.org/T284128) (owner: 10Kormat) [12:11:14] 10SRE, 10Traffic, 10User-ArielGlenn, 10User-MoritzMuehlenhoff, 10User-jbond: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456 (10jbond) I have updated puppetdb to use nginx-light. further It should now be fairly simple to switch other services to nginx-light. 1. Add `profile::nginx` [... [12:12:42] (03PS2) 10Nikki Nikkhoui: Helmfile for image suggestion api [deployment-charts] - 10https://gerrit.wikimedia.org/r/697733 (https://phabricator.wikimedia.org/T281257) [12:13:59] (03CR) 10Kormat: [V: 03+1 C: 03+2] db1125: Rename back from dbstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/697736 (https://phabricator.wikimedia.org/T284128) (owner: 10Kormat) [12:14:21] (03CR) 10jerkins-bot: [V: 04-1] Helmfile for image suggestion api [deployment-charts] - 10https://gerrit.wikimedia.org/r/697733 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui) [12:15:33] (03PS1) 10Filippo Giunchedi: alerts: reload prometheus instances after deploy [puppet] - 10https://gerrit.wikimedia.org/r/697737 (https://phabricator.wikimedia.org/T282806) [12:23:37] (03PS1) 10Muehlenhoff: Add nginx profile to apt_repo [puppet] - 10https://gerrit.wikimedia.org/r/697739 (https://phabricator.wikimedia.org/T164456) [12:23:39] (03PS1) 10Muehlenhoff: Switch apt* to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/697740 (https://phabricator.wikimedia.org/T164456) [12:24:25] (03CR) 10jerkins-bot: [V: 04-1] Switch apt* to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/697740 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [12:24:48] 10SRE, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to contint-admins for Ladsgroup - https://phabricator.wikimedia.org/T283925 (10thcipriani) >>! In T283925#7124935, @Marostegui wrote: > I think it needs approval from @greg or @thcipriani Approved... [12:27:41] 10SRE, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to contint-admins for Ladsgroup - https://phabricator.wikimedia.org/T283925 (10Marostegui) [12:28:14] (03PS2) 10Marostegui: data.yaml: Add Ladsgroup to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/697614 (https://phabricator.wikimedia.org/T283925) [12:29:15] (03CR) 10Marostegui: [C: 03+2] data.yaml: Add Ladsgroup to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/697614 (https://phabricator.wikimedia.org/T283925) (owner: 10Marostegui) [12:29:48] (03PS13) 10Jbond: role:mx: add script to generate otrs aliases [puppet] - 10https://gerrit.wikimedia.org/r/623608 (https://phabricator.wikimedia.org/T244792) [12:30:06] 10SRE, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to contint-admins for Ladsgroup - https://phabricator.wikimedia.org/T283925 (10Marostegui) 05Open→03Resolved Patch merged, please give it sometime for puppet to run everywhere. [12:32:20] 10SRE, 10SRE-Access-Requests: Requesting access to production deployment for David Lynch - https://phabricator.wikimedia.org/T283607 (10LSobanski) I updated the Wikitech instructions page to clarify the requirements (diff here: https://wikitech.wikimedia.org/w/index.php?title=Creating_new_tables&type=revision&... [12:33:21] (03CR) 10Jbond: sre: convert the generic reboot functions to the cookbook class API (0328 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (https://phabricator.wikimedia.org/T284079) (owner: 10Jbond) [12:33:52] there might be alerts for ms-be106* of / space available, that's me [12:37:41] (03PS1) 10Kormat: install_server: Temporarily set db1125 to destructive install. [puppet] - 10https://gerrit.wikimedia.org/r/697779 (https://phabricator.wikimedia.org/T284128) [12:39:16] (03CR) 10Kormat: [C: 03+2] install_server: Temporarily set db1125 to destructive install. [puppet] - 10https://gerrit.wikimedia.org/r/697779 (https://phabricator.wikimedia.org/T284128) (owner: 10Kormat) [12:40:41] (03PS12) 10Jbond: role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) [12:41:52] (03CR) 10Jbond: [C: 03+2] role:mx: add script to generate otrs aliases (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/623608 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [12:42:36] (03CR) 10jerkins-bot: [V: 04-1] role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [12:43:25] 10SRE, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to contint-admins for Ladsgroup - https://phabricator.wikimedia.org/T283925 (10Ladsgroup) Thank you <3 [12:46:35] (03PS1) 10Jbond: P:mail::mx: add default gsuit mx [puppet] - 10https://gerrit.wikimedia.org/r/697781 [12:47:13] (03CR) 10Jbond: [C: 03+2] P:mail::mx: add default gsuit mx [puppet] - 10https://gerrit.wikimedia.org/r/697781 (owner: 10Jbond) [12:49:17] (03PS1) 10Jbond: P:mail::mx: fix template location [puppet] - 10https://gerrit.wikimedia.org/r/697782 [12:51:48] (03CR) 10Jbond: [C: 03+2] P:mail::mx: fix template location [puppet] - 10https://gerrit.wikimedia.org/r/697782 (owner: 10Jbond) [12:52:46] (03PS2) 10Muehlenhoff: Add nginx profile to apt_repo [puppet] - 10https://gerrit.wikimedia.org/r/697739 (https://phabricator.wikimedia.org/T164456) [12:54:02] (03CR) 10Ladsgroup: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/697130 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [12:55:20] (03PS6) 10Jbond: mx2001: disable ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/612826 [12:55:54] (03CR) 10jerkins-bot: [V: 04-1] mx2001: disable ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/612826 (owner: 10Jbond) [12:58:02] (03PS7) 10Jbond: mx2001: disable ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/612826 [12:58:35] (03CR) 10jerkins-bot: [V: 04-1] mx2001: disable ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/612826 (owner: 10Jbond) [12:59:17] PROBLEM - Check systemd state on mx2001 is CRITICAL: CRITICAL - degraded: The following units failed: generate_otrs_aliases.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:59:36] ^ this is me will fix [13:02:15] (03PS13) 10Jbond: role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) [13:03:43] (03CR) 10jerkins-bot: [V: 04-1] role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [13:07:40] (03PS1) 10Jbond: P:mail::mx: update systemd timer so that it fails silently [puppet] - 10https://gerrit.wikimedia.org/r/697785 (https://phabricator.wikimedia.org/T284145) [13:08:18] I'm running a sequence of parse commands against the alswiki API, but it randomly just hangs up at some point and doesn't respond. [13:08:50] PROBLEM - Check systemd state on mx1001 is CRITICAL: CRITICAL - degraded: The following units failed: generate_otrs_aliases.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:08:51] I'm trying to figure out if this is an issue with curl, or if I'm hitting a bad endpoint at times. [13:12:28] (03CR) 10Jbond: [C: 03+2] P:mail::mx: update systemd timer so that it fails silently [puppet] - 10https://gerrit.wikimedia.org/r/697785 (https://phabricator.wikimedia.org/T284145) (owner: 10Jbond) [13:13:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: rack/setup/install fran1001 - https://phabricator.wikimedia.org/T245554 (10Jgreen) [13:14:12] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install fran2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T282056 (10Jgreen) [13:14:15] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1125.eqiad.wmnet with reason: REIMAGE [13:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:32] RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:16:01] (03PS1) 10Vlad.shapik: Replace uses of AbstractBlock::getTarget() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697786 (https://phabricator.wikimedia.org/T284141) [13:16:02] RECOVERY - Check systemd state on mx1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:16:22] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1125.eqiad.wmnet with reason: REIMAGE [13:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:06] (03CR) 10Vlad.shapik: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697786 (https://phabricator.wikimedia.org/T284141) (owner: 10Vlad.shapik) [13:21:18] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/695230 (https://phabricator.wikimedia.org/T216088) (owner: 10Jbond) [13:25:00] 10SRE, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to SUPERSET for CMADEO - https://phabricator.wikimedia.org/T284109 (10Ottomata) Approved! no ssh or kerberos needed. [13:26:52] 10SRE, 10Analytics, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to Superset/Turnilo for Kgordon - https://phabricator.wikimedia.org/T283057 (10Ottomata) Hi! Approved! It'd be nice if the ticket explained the reason for the access request, not just the services wanted. @Maryana or @Kgordon c... [13:27:14] (03PS1) 10Kormat: Revert "install_server: Temporarily set db1125 to destructive install." [puppet] - 10https://gerrit.wikimedia.org/r/697806 [13:27:24] (03CR) 10Muehlenhoff: [C: 03+1] "Didn't check the module itself, but looks totally sane to import" [puppet] - 10https://gerrit.wikimedia.org/r/696380 (owner: 10Jbond) [13:29:01] (03CR) 10Kormat: [C: 03+2] Revert "install_server: Temporarily set db1125 to destructive install." [puppet] - 10https://gerrit.wikimedia.org/r/697806 (owner: 10Kormat) [13:32:12] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/697734 (https://phabricator.wikimedia.org/T256098) (owner: 10Muehlenhoff) [13:40:24] (03CR) 10Kormat: [C: 03+1] "One comment, but +1 to the change in general." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/697723 (https://phabricator.wikimedia.org/T275852) (owner: 10Jbond) [13:48:29] (03PS2) 10Ema: Netops team alert: ping offload [alerts] - 10https://gerrit.wikimedia.org/r/697710 (https://phabricator.wikimedia.org/T282806) [13:51:38] (03PS2) 10Jbond: sudo: drop keep_env option [puppet] - 10https://gerrit.wikimedia.org/r/697723 (https://phabricator.wikimedia.org/T275852) [13:52:07] (03CR) 10Jbond: sudo: drop keep_env option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/697723 (https://phabricator.wikimedia.org/T275852) (owner: 10Jbond) [13:54:18] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10User-fgiunchedi: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10Papaul) On the Raritan PDU all the power cords fits well. the PDU also mount well on our rack. The only problem is, the plugs are on 2 rows and not on single r... [13:56:40] 10SRE, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to SUPERSET for CMADEO - https://phabricator.wikimedia.org/T284109 (10cmadeo) Thanks @ottomata + @colewhite! Do I still need approval from @lucyblackwell ? [13:59:29] (03PS5) 10Jbond: concat: Add puppetlabs-concat module [puppet] - 10https://gerrit.wikimedia.org/r/696380 [13:59:34] 10SRE, 10Analytics-Radar, 10SRE-Access-Requests: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Bumeh-ctr) Successfully logged in to Superset! [13:59:39] (03PS18) 10Jbond: profile::contacts: add a profile and define for adding contact metadata [puppet] - 10https://gerrit.wikimedia.org/r/695230 (https://phabricator.wikimedia.org/T216088) [14:03:18] 10SRE, 10SRE-Access-Requests: Allow JStephenson to access Superset - https://phabricator.wikimedia.org/T282515 (10colewhite) 05Open→03Resolved [14:04:01] (03CR) 10Ayounsi: [C: 04-1] "I worry we're moving alerts from Icinga to Alert Manager while the user experience of https://alerts.wikimedia.org/ is far behind https://" (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/697710 (https://phabricator.wikimedia.org/T282806) (owner: 10Ema) [14:09:57] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:11:48] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10User-fgiunchedi: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10Papaul) TICKET NO. 1979021 was create to request to plug the Raritan PDU in Rack D8. [14:12:42] 10SRE, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to SUPERSET for CMADEO - https://phabricator.wikimedia.org/T284109 (10lucyblackwell) I Lucy Blackwell, Carolyn’s manager approve this! [14:16:49] 10SRE, 10Analytics, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to Superset/Turnilo for Kgordon - https://phabricator.wikimedia.org/T283057 (10Kgordon) [14:17:09] 10SRE, 10Analytics, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to Superset/Turnilo for Kgordon - https://phabricator.wikimedia.org/T283057 (10Kgordon) @Ottomata, done! Thank you very much! [14:17:25] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:22] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install fran2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T282056 (10Jgreen) [14:31:00] (03PS3) 10Muehlenhoff: Add nginx profile to apt_repo [puppet] - 10https://gerrit.wikimedia.org/r/697739 (https://phabricator.wikimedia.org/T164456) [14:32:38] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install fran2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T282056 (10Jgreen) [14:33:19] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install fran2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T282056 (10Jgreen) 05Open→03Resolved [14:44:57] (03CR) 10Filippo Giunchedi: Netops team alert: ping offload (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/697710 (https://phabricator.wikimedia.org/T282806) (owner: 10Ema) [14:49:14] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/697739 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [14:50:09] (03CR) 10Filippo Giunchedi: "> Patch Set 2: Code-Review-1" (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/697710 (https://phabricator.wikimedia.org/T282806) (owner: 10Ema) [14:51:28] (03Abandoned) 10Muehlenhoff: Switch apt* to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/697740 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [14:56:55] 10SRE, 10observability, 10CAS-SSO, 10User-jbond: thanos u/i gives errors if left idle for a few hours - https://phabricator.wikimedia.org/T268233 (10fgiunchedi) Still seeing CORS errors from grafana-rw when left idle for a few hours, Chrome's console: {F34478522} And the network tab: {F34478524} [15:03:50] Jbond [15:04:03] (jbond) [15:12:46] (jbond) [15:13:02] Jbond [15:13:08] Jbond [15:15:02] sorry marostegui was me testing something [15:15:15] jbond: :), you can come back, I didn't ban, just a kick [15:15:51] marostegui: thanks ill test in a different room to reduce the noise :) [15:27:13] (03PS1) 10Muehlenhoff: Switch ncredir to profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/697799 [15:28:49] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/697799 (owner: 10Muehlenhoff) [15:29:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: server hardlocking for cloudmetrics1002.eqiad.wmnet - https://phabricator.wikimedia.org/T281881 (10dcaro) [15:31:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: server hardlocking for cloudmetrics1002.eqiad.wmnet - https://phabricator.wikimedia.org/T281881 (10dcaro) [15:32:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: server hardlocking for cloudmetrics1002.eqiad.wmnet - https://phabricator.wikimedia.org/T281881 (10dcaro) Machine out of service and marked as failed in netbox, feel free to take it out/debug/troubleshoot it :) [15:33:10] (03CR) 10Effie Mouzeli: "> Patch Set 2:" (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/697710 (https://phabricator.wikimedia.org/T282806) (owner: 10Ema) [15:33:53] (03CR) 10Effie Mouzeli: "Sorry for the drive by comment, I accidentally run into this:)" [alerts] - 10https://gerrit.wikimedia.org/r/697710 (https://phabricator.wikimedia.org/T282806) (owner: 10Ema) [15:41:47] (03PS1) 10Ottomata: Hive log4j -= Use PidDailyRollingFileAppender for CLI sessions [puppet] - 10https://gerrit.wikimedia.org/r/697802 (https://phabricator.wikimedia.org/T283126) [15:43:33] (03CR) 10Ottomata: [C: 03+2] Hive log4j -= Use PidDailyRollingFileAppender for CLI sessions [puppet] - 10https://gerrit.wikimedia.org/r/697802 (https://phabricator.wikimedia.org/T283126) (owner: 10Ottomata) [15:47:45] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:00] 10SRE, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production analytics data and cluster for htriedman - https://phabricator.wikimedia.org/T283368 (10Htriedman) @JBennett tagging you to flag that you need to sign off on this Re: Contract expiration, it's set to expire at t... [15:52:48] (03CR) 10Kadirselcuk: [C: 03+1] (Test): Example PR demonstrating the contacts profile [puppet] - 10https://gerrit.wikimedia.org/r/695236 (https://phabricator.wikimedia.org/T216088) (owner: 10Jbond) [15:56:46] (03CR) 10Filippo Giunchedi: "> Patch Set 2:" (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/697710 (https://phabricator.wikimedia.org/T282806) (owner: 10Ema) [15:59:46] 10SRE, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production analytics data and cluster for htriedman - https://phabricator.wikimedia.org/T283368 (10JBennett) Approved from my end. [15:59:47] !log sukhe@cumin1001 START - Cookbook sre.hosts.decommission for hosts cescout1001.eqiad.wmnet [15:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:44] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:01:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:35] (03PS1) 10Ottomata: Remove more python packages from stat boxes [puppet] - 10https://gerrit.wikimedia.org/r/697804 (https://phabricator.wikimedia.org/T275786) [16:10:34] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cescout1001.eqiad.wmnet [16:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:39] 10SRE, 10Traffic, 10decommission-hardware: decommission cescout1001.eqiad.wmnet - https://phabricator.wikimedia.org/T275696 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin1001 for hosts: `cescout1001.eqiad.wmnet` - cescout1001.eqiad.wmnet (**PASS**) - Downtimed host on Icing... [16:13:30] (03PS1) 10Ladsgroup: mailman: force lowercase in checking for redirects in apache [puppet] - 10https://gerrit.wikimedia.org/r/697805 (https://phabricator.wikimedia.org/T284124) [16:19:31] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:10] !bash I haven't tested it yet but I found it in the internet so it must be true. (from https://gerrit.wikimedia.org/r/c/operations/puppet/+/697805/) [16:20:11] legoktm: Stored quip at https://bash.toolforge.org/quip/odGGzXkB1jz_IcWucuEx [16:20:45] :D Funnily enough, it doesn't work [16:20:47] debugging [16:21:25] (03CR) 10Marostegui: "Razzi, can you rebase locally and merge it? Even if we cannot work on the host itself this week due to dbstore1004 being used at the momen" [puppet] - 10https://gerrit.wikimedia.org/r/697706 (https://phabricator.wikimedia.org/T283125) (owner: 10Razzi) [16:21:28] (03PS1) 10Ssingh: site: decommission cescout1001 [puppet] - 10https://gerrit.wikimedia.org/r/697828 (https://phabricator.wikimedia.org/T275696) [16:21:31] Everything you read on the internet is true [16:23:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:23:10] https://httpd.apache.org/docs/current/mod/mod_rewrite.html#rewritecond doesn't have anything for case adjusting [16:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:05] Amir1: try https://stackoverflow.com/questions/39273477/case-insensitive-file-check-for-rewritecond ? [16:24:53] oh thanks [16:25:57] (03PS2) 10Razzi: site: give mariadb role to dbstore1007 [puppet] - 10https://gerrit.wikimedia.org/r/697706 (https://phabricator.wikimedia.org/T283125) [16:26:42] (03CR) 10Marostegui: [C: 03+1] site: give mariadb role to dbstore1007 [puppet] - 10https://gerrit.wikimedia.org/r/697706 (https://phabricator.wikimedia.org/T283125) (owner: 10Razzi) [16:26:55] (03CR) 10Marostegui: [C: 03+1] "This still needs the FW to be opened on the other patch, right?" [puppet] - 10https://gerrit.wikimedia.org/r/697706 (https://phabricator.wikimedia.org/T283125) (owner: 10Razzi) [16:26:57] (03CR) 10Ssingh: [C: 03+2] site: decommission cescout1001 [puppet] - 10https://gerrit.wikimedia.org/r/697828 (https://phabricator.wikimedia.org/T275696) (owner: 10Ssingh) [16:27:29] no [16:27:30] sigh [16:27:48] oh it needs $ twice [16:27:53] something is greedy [16:28:02] two dollars? [16:28:06] very greedy [16:28:56] didn't work, probably needs more dollars [16:29:11] Are you going to do a Wolf of Wall Street? [16:29:30] wait, it had caching [16:29:33] (03CR) 10Razzi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/697706 (https://phabricator.wikimedia.org/T283125) (owner: 10Razzi) [16:29:52] I think we insulted apache2 enough so it stopped cooperating [16:29:52] (03PS3) 10Razzi: site: give mariadb role to dbstore1007 [puppet] - 10https://gerrit.wikimedia.org/r/697706 (https://phabricator.wikimedia.org/T283125) [16:30:06] $$$$$$$$$ Work you damnation engine! [16:30:11] Problem solved [16:30:12] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/697739 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [16:30:54] 10SRE, 10Traffic, 10decommission-hardware, 10Patch-For-Review: decommission cescout1001.eqiad.wmnet - https://phabricator.wikimedia.org/T275696 (10ssingh) Puppet roles/profiles/hiera configurations have not been removed as they will be needed for the cescout VM that will be provisioned later. [16:33:28] (03PS2) 10Ladsgroup: mailman: force lowercase in checking for redirects in apache [puppet] - 10https://gerrit.wikimedia.org/r/697805 (https://phabricator.wikimedia.org/T284124) [16:34:09] Amir1: are all the lists in mm3 all lowercase? [16:34:50] legoktm: AFAIK, I read somewhere in documentation that it's forced [16:35:37] I'll just create the WIKITECH-L list as a test! if it succeeds we can use it for "wikitech-l but only shouting" [16:36:30] IT'S WHERE ALL THE IMPORTANT DISCUSSIONS HAPPEN [16:36:52] SHOUT ONLY IF SOMEONE KILLED WIKIMEDIA AGAIN [16:38:21] I checked, they're all lowercase [16:38:36] https://docs.mailman3.org/projects/mailman/en/latest/src/mailman/docs/NEWS.html?highlight=lower#id58 [16:39:14] https://docs.mailman3.org/projects/mailman/en/latest/src/mailman/model/docs/addresses.html?highlight=lower#case-preserved-addresses [16:39:33] https://docs.mailman3.org/projects/mailman/en/latest/src/mailman/rest/docs/addresses.html?highlight=lower#addresses [16:39:39] These three seem related [16:42:47] 10SRE, 10Data-Persistence-Backup, 10bacula, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) [16:46:19] (03CR) 10Legoktm: [C: 03+2] mailman: force lowercase in checking for redirects in apache [puppet] - 10https://gerrit.wikimedia.org/r/697805 (https://phabricator.wikimedia.org/T284124) (owner: 10Ladsgroup) [16:47:44] https://lists.wikimedia.org/mailman/listinfo/WiKiTeCh-L [16:48:24] Wikitech-l works fine though [16:52:37] works for me [16:52:39] let's close it [16:53:00] the above link didn't work a few minutes ago, but it works now [16:53:05] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review, 10User-Ladsgroup: Old mailing list info page URLs are 404s when listname is written with capital letter - https://phabricator.wikimedia.org/T284124 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Works now. [16:53:07] (for me) [16:53:38] huh [16:53:41] it works now too [16:53:57] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review, 10User-Ladsgroup: Old mailing list info page URLs are 404s when listname is written with capital letter - https://phabricator.wikimedia.org/T284124 (10Legoktm) Wikitech-l now works, though https://lists.wikimedia.org/mailman/listinfo/WiKiTeCh-L does not... [16:54:22] it works in FF for me but curl 404s [16:54:57] I restarted apache (not just a reload) and now it works consistently [16:55:17] !log restarted apache2 on lists1001 for https://gerrit.wikimedia.org/r/697805 [16:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:43] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review, 10User-Ladsgroup: Old mailing list info page URLs are 404s when listname is written with capital letter - https://phabricator.wikimedia.org/T284124 (10Legoktm) >>! In T284124#7129657, @Legoktm wrote: > Wikitech-l now works, though https://lists.wikimedi... [17:11:43] (03PS1) 10Ryan Kemper: reimage: raid0.default_layout=2 for all installers [puppet] - 10https://gerrit.wikimedia.org/r/697832 (https://phabricator.wikimedia.org/T274788) [17:12:09] (03CR) 10Marostegui: [C: 03+2] site: give mariadb role to dbstore1007 [puppet] - 10https://gerrit.wikimedia.org/r/697706 (https://phabricator.wikimedia.org/T283125) (owner: 10Razzi) [17:12:18] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:14:19] (03PS2) 10Ryan Kemper: reimage: raid0.default_layout=2 for all installers [puppet] - 10https://gerrit.wikimedia.org/r/697832 (https://phabricator.wikimedia.org/T274788) [17:15:46] (03PS1) 10Marostegui: site.pp: Fix dbstore1007 role [puppet] - 10https://gerrit.wikimedia.org/r/697833 (https://phabricator.wikimedia.org/T283125) [17:15:47] razzi: ^ [17:16:33] (03CR) 10Razzi: [C: 03+1] site.pp: Fix dbstore1007 role [puppet] - 10https://gerrit.wikimedia.org/r/697833 (https://phabricator.wikimedia.org/T283125) (owner: 10Marostegui) [17:16:42] (03CR) 10Marostegui: [C: 03+2] site.pp: Fix dbstore1007 role [puppet] - 10https://gerrit.wikimedia.org/r/697833 (https://phabricator.wikimedia.org/T283125) (owner: 10Marostegui) [17:16:53] (03CR) 10Ryan Kemper: "- Appends raid0.default_layout=2 to all installers" [puppet] - 10https://gerrit.wikimedia.org/r/697832 (https://phabricator.wikimedia.org/T274788) (owner: 10Ryan Kemper) [17:23:22] (03CR) 10Elukey: site.pp: Fix dbstore1007 role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/697833 (https://phabricator.wikimedia.org/T283125) (owner: 10Marostegui) [17:24:00] (03CR) 10Marostegui: "yes, it was a quick fix to stop puppet for breaking in dbstore1007 😊" [puppet] - 10https://gerrit.wikimedia.org/r/697833 (https://phabricator.wikimedia.org/T283125) (owner: 10Marostegui) [17:25:16] (03CR) 10Elukey: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/697833 (https://phabricator.wikimedia.org/T283125) (owner: 10Marostegui) [17:25:45] (03CR) 10Marostegui: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/697833 (https://phabricator.wikimedia.org/T283125) (owner: 10Marostegui) [17:27:36] (03CR) 10Kadirselcuk: [C: 03+1] site.pp: Fix dbstore1007 role [puppet] - 10https://gerrit.wikimedia.org/r/697833 (https://phabricator.wikimedia.org/T283125) (owner: 10Marostegui) [17:28:38] (03CR) 10Kadirselcuk: [C: 03+1] site: give mariadb role to dbstore1007 [puppet] - 10https://gerrit.wikimedia.org/r/697706 (https://phabricator.wikimedia.org/T283125) (owner: 10Razzi) [17:29:01] razzi: o/ if you have a min can you remove dbstore1006 from the site.pp regex? (low priority, whenever you have time) [17:32:29] (03PS1) 10Razzi: site: remove dbstore1006 from mariadb role [puppet] - 10https://gerrit.wikimedia.org/r/697834 (https://phabricator.wikimedia.org/T283125) [17:32:43] elukey: like that ^ ? [17:33:02] !log disabled Kadirselcuk gerrit account, +1 spam (and blocked elsewhere) [17:33:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:42] razzi: yes it should work! Just as precaution, run pcc for all the dbstores (I'd have just added [3457]) [17:41:42] (03CR) 10Elukey: [C: 03+1] "Ran ppc https://puppet-compiler.wmflabs.org/compiler1002/29783/" [puppet] - 10https://gerrit.wikimedia.org/r/697834 (https://phabricator.wikimedia.org/T283125) (owner: 10Razzi) [17:47:11] (03PS1) 10Ebernhardson: mjolnir: Stop listening to BC topics [puppet] - 10https://gerrit.wikimedia.org/r/697836 (https://phabricator.wikimedia.org/T261407) [17:48:01] (03CR) 10Ebernhardson: [C: 04-1] "Topics currently in use, once other patches have gone this will come last." [puppet] - 10https://gerrit.wikimedia.org/r/697836 (https://phabricator.wikimedia.org/T261407) (owner: 10Ebernhardson) [17:51:33] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools' topicsubscription as beta feature on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697837 (https://phabricator.wikimedia.org/T284169) [17:51:35] (03PS1) 10Bartosz Dziewoński: Make DiscussionTools' replytool available for everyone on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697838 (https://phabricator.wikimedia.org/T283119) [17:53:00] jouncebot: refresh [17:53:00] I refreshed my knowledge about deployments. [18:00:05] RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210602T1800). [18:00:05] MatmaRex: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:24] i can deploy today [18:00:28] hello [18:00:44] (03CR) 10Urbanecm: [C: 03+2] Enable DiscussionTools' topicsubscription as beta feature on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697837 (https://phabricator.wikimedia.org/T284169) (owner: 10Bartosz Dziewoński) [18:00:58] urbanecm: can you confirm that the 'discussiontools_subscription' table exists on the beta cluster wikis? [18:00:59] (03CR) 10Urbanecm: [C: 03+2] Make DiscussionTools' replytool available for everyone on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697838 (https://phabricator.wikimedia.org/T283119) (owner: 10Bartosz Dziewoński) [18:01:09] i'm pretty sure the schema updates happen automatically [18:01:12] but not completely sure [18:01:23] MatmaRex: as long as it gets created after update.php, it is there [18:01:29] (03Merged) 10jenkins-bot: Enable DiscussionTools' topicsubscription as beta feature on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697837 (https://phabricator.wikimedia.org/T284169) (owner: 10Bartosz Dziewoński) [18:01:33] yes. thanks [18:01:43] (03Merged) 10jenkins-bot: Make DiscussionTools' replytool available for everyone on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697838 (https://phabricator.wikimedia.org/T283119) (owner: 10Bartosz Dziewoński) [18:02:59] but it is there, just checked https://www.irccloud.com/pastebin/XvFvfaih/ [18:03:43] (03CR) 10Ottomata: [C: 03+2] Remove more python packages from stat boxes [puppet] - 10https://gerrit.wikimedia.org/r/697804 (https://phabricator.wikimedia.org/T275786) (owner: 10Ottomata) [18:03:45] fyi: I'm going to sync out the wikitech patch, as wikitech doesn't support mwdebug [18:05:12] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 4bf76fc09bc06f76ce842d42b77fe6b036943b69: Make DiscussionTools replytool available for everyone on wikitech (T283119) (duration: 00m 58s) [18:05:15] MatmaRex: should be live. Can you make sure it works and lmk? [18:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:17] T283119: Offer the Reply and New Discussion Tools at Wikitech - https://phabricator.wikimedia.org/T283119 [18:05:23] yeah [18:06:08] (03CR) 10Ebernhardson: "I've manually run the swift_upload.py script from analytics to produce test events to the referenced topics, and then verified with kafkac" [puppet] - 10https://gerrit.wikimedia.org/r/693205 (https://phabricator.wikimedia.org/T261407) (owner: 10Ebernhardson) [18:06:47] urbanecm: wikitech looks good [18:06:51] great [18:06:59] the beta one will roll out automatically...soon [18:07:00] and beta happens automatically in a few minutes, i assume [18:07:02] yeah. thanks [18:07:02] yeah [18:07:26] assuming we're done, I'm going to do a quick security deploy [18:08:02] thanks, i have nothing else [18:10:09] any time :) [18:11:25] !log Deployed security patch for T281972 [18:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:17] (03CR) 10Ryan Kemper: [C: 03+2] mjolnir bulk daemon: Add topic for hourly updates [puppet] - 10https://gerrit.wikimedia.org/r/693205 (https://phabricator.wikimedia.org/T261407) (owner: 10Ebernhardson) [18:17:28] (03Abandoned) 10Ryan Kemper: wdqs: hack issue blocking reimage on some hosts [puppet] - 10https://gerrit.wikimedia.org/r/689525 (https://phabricator.wikimedia.org/T280382) (owner: 10Ryan Kemper) [18:18:44] (03PS3) 10Ottomata: Set up airflow-analytics on an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/697653 (https://phabricator.wikimedia.org/T272973) [18:18:46] (03PS1) 10Ottomata: airflow-analytics-test - set dags folder to /srv/airflow-analytics-test-dags [puppet] - 10https://gerrit.wikimedia.org/r/697841 (https://phabricator.wikimedia.org/T272973) [18:20:44] (03PS2) 10Ottomata: airflow-analytics-test - set dags folder to /srv/airflow-analytics-test-dags [puppet] - 10https://gerrit.wikimedia.org/r/697841 (https://phabricator.wikimedia.org/T272973) [18:28:14] (03CR) 10Ottomata: [C: 03+2] airflow-analytics-test - set dags folder to /srv/airflow-analytics-test-dags [puppet] - 10https://gerrit.wikimedia.org/r/697841 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [18:31:34] (03PS3) 10Urbanecm: Revert "enwiktionary: Raise AF emergency disable treshold+count" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697119 (https://phabricator.wikimedia.org/T283460) [18:31:37] (03CR) 10Urbanecm: [C: 03+2] Revert "enwiktionary: Raise AF emergency disable treshold+count" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697119 (https://phabricator.wikimedia.org/T283460) (owner: 10Urbanecm) [18:32:23] (03Merged) 10jenkins-bot: Revert "enwiktionary: Raise AF emergency disable treshold+count" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697119 (https://phabricator.wikimedia.org/T283460) (owner: 10Urbanecm) [18:37:14] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: e9c981d5173b1d611458f6c70b34d73476b7bbde: Revert "enwiktionary: Raise AF emergency disable treshold+count" (T283460) (duration: 00m 58s) [18:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:21] * urbanecm done [18:39:02] (03PS1) 10Dzahn: gitlab: add Bacula backup::host class to role [puppet] - 10https://gerrit.wikimedia.org/r/697844 (https://phabricator.wikimedia.org/T274463) [18:48:53] (03CR) 10Wolfgang Kandek: [C: 03+1] gitlab: add Bacula backup::host class to role [puppet] - 10https://gerrit.wikimedia.org/r/697844 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [18:50:09] (03PS13) 10Ottomata: Initial debianization and 2.1.0-py3.7-1 release [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/693222 (https://phabricator.wikimedia.org/T277012) [18:50:35] (03CR) 10Ottomata: [C: 03+2] "Merging, please review if you get a chance :)" [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/693222 (https://phabricator.wikimedia.org/T277012) (owner: 10Ottomata) [18:50:37] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Initial debianization and 2.1.0-py3.7-1 release [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/693222 (https://phabricator.wikimedia.org/T277012) (owner: 10Ottomata) [18:56:23] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/29784/gitlab1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/697844 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [18:57:11] (03CR) 10Dzahn: [V: 03+1 C: 03+2] gitlab: add Bacula backup::host class to role [puppet] - 10https://gerrit.wikimedia.org/r/697844 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [18:57:16] (03PS2) 10Dzahn: gitlab: add Bacula backup::host class to role [puppet] - 10https://gerrit.wikimedia.org/r/697844 (https://phabricator.wikimedia.org/T274463) [19:04:52] could someone share with me the credentials for beta cluster logstash? https://wikitech.wikimedia.org/wiki/Logstash#Beta_Cluster_Logstash i don't think i have access to that server holding them [19:05:58] MatmaRex: i don't think betacluster logstash currently even gets logs, it's kind of a mess [19:06:34] MatmaRex: i can, but AFAICS, it's broken anyway [19:07:14] hmm, okay then, i was hoping it worked :D [19:07:54] https://logstash-beta.wmflabs.org/ says "Service Unavailable" "The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later." [19:07:57] yeah [19:08:43] you're all welcome to fix it (or the updated elk7 setup that I didn't manage to get working) [19:08:47] MatmaRex: deployment-mwlog01.deployment-prep.eqiad1.wikimedia.cloud has TXT logs. I can give you shell there if you want [19:09:45] eh, no need, it's probably easier to just try things and see if they fix my bug, instead of trying to get the logs [19:09:51] i wanted to debug https://phabricator.wikimedia.org/T284175 [19:09:57] thanks [19:10:01] okay [19:10:24] MatmaRex: if you need any logs, feel free to ping me, i'm happy to get them from mwlog [19:10:48] not sure which kind of logs you'd need [19:12:27] me neither :D [19:15:44] (03PS1) 10Dzahn: bacula/gitlab: add a backup::set for gitlab and use it [puppet] - 10https://gerrit.wikimedia.org/r/697850 (https://phabricator.wikimedia.org/T274463) [19:21:08] (03CR) 10Dzahn: "So yea, basically request for review here is to confirm whether /srv/gitlab-backup is correct and the only place that needs backups. Path " [puppet] - 10https://gerrit.wikimedia.org/r/697850 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [19:24:03] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install frdev1002 - https://phabricator.wikimedia.org/T282054 (10Jclark-ctr) [19:28:50] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install frdev1002 - https://phabricator.wikimedia.org/T282054 (10Jclark-ctr) frdev1002 c1 u27 port19. id#4036/1941 [19:28:54] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install frdev1002 - https://phabricator.wikimedia.org/T282054 (10Jclark-ctr) [19:29:24] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install frdev1002 - https://phabricator.wikimedia.org/T282054 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [19:29:54] 10SRE, 10Platform Engineering (Icebox), 10User-Eevans: New upstream jvm-tools - https://phabricator.wikimedia.org/T178839 (10Eevans) a:05Eevans→03None [19:32:07] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install copernicium - https://phabricator.wikimedia.org/T282272 (10Jclark-ctr) [19:34:43] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install copernicium - https://phabricator.wikimedia.org/T282272 (10Jclark-ctr) copernicium B4 U42 Port Cableid#5368 [19:35:39] (03CR) 10Razzi: [C: 03+2] site: remove dbstore1006 from mariadb role [puppet] - 10https://gerrit.wikimedia.org/r/697834 (https://phabricator.wikimedia.org/T283125) (owner: 10Razzi) [19:36:40] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install copernicium - https://phabricator.wikimedia.org/T282272 (10Jclark-ctr) [19:36:54] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install copernicium - https://phabricator.wikimedia.org/T282272 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [19:37:25] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create two Ganeti VMs for Wikidough in esams - https://phabricator.wikimedia.org/T283852 (10Dzahn) a:03Dzahn [19:37:34] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create two Ganeti VMs for Wikidough in esams - https://phabricator.wikimedia.org/T283852 (10Dzahn) p:05Medium→03High [19:40:03] (03PS3) 10Ryan Kemper: reimage: raid0.default_layout=2 for all installers [puppet] - 10https://gerrit.wikimedia.org/r/697832 (https://phabricator.wikimedia.org/T274788) [19:42:32] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host doh3001.wikimedia.org [19:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:23] mutante: <3 [19:45:44] (03PS1) 10Zabe: Avoid using MWNamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697851 [19:46:39] (03PS1) 10Dzahn: site: add wikidough esams with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/697852 (https://phabricator.wikimedia.org/T283852) [19:47:09] RECOVERY - Maps - OSM synchronization lag - eqiad on alert1001 is OK: (C)2.592e+05 ge (W)1.764e+05 ge 1.756e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [19:50:34] (03PS1) 10Jdlrobson: Enable wgVectorConsolidateUserLinks on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697855 (https://phabricator.wikimedia.org/T266536) [19:52:53] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:53:42] (03CR) 10Ssingh: [C: 03+1] "I already had a patch for this https://gerrit.wikimedia.org/r/c/operations/puppet/+/696605/ but yours has a better description, so let's u" [puppet] - 10https://gerrit.wikimedia.org/r/697852 (https://phabricator.wikimedia.org/T283852) (owner: 10Dzahn) [19:54:06] (03Abandoned) 10Ssingh: site: add doh3001 and doh3002 with role insetup [puppet] - 10https://gerrit.wikimedia.org/r/696605 (https://phabricator.wikimedia.org/T283852) (owner: 10Ssingh) [19:54:37] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:55:33] (03CR) 10Dzahn: [C: 03+2] site: add wikidough esams with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/697852 (https://phabricator.wikimedia.org/T283852) (owner: 10Dzahn) [20:00:04] chrisalbon and accraze: Your horoscope predicts another unfortunate Services – Graphoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210602T2000). [20:01:00] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh3001.wikimedia.org [20:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:23] (03PS1) 10Dzahn: DHCP: add doh3 to partman regex, add MAC for doh3001 [puppet] - 10https://gerrit.wikimedia.org/r/697858 (https://phabricator.wikimedia.org/T283852) [20:06:22] (03CR) 10jerkins-bot: [V: 04-1] DHCP: add doh3 to partman regex, add MAC for doh3001 [puppet] - 10https://gerrit.wikimedia.org/r/697858 (https://phabricator.wikimedia.org/T283852) (owner: 10Dzahn) [20:06:59] (03CR) 10Dzahn: [V: 04-1] "13:06:03 dhcp configuration: NOT OK" [puppet] - 10https://gerrit.wikimedia.org/r/697858 (https://phabricator.wikimedia.org/T283852) (owner: 10Dzahn) [20:08:30] (03PS2) 10Dzahn: DHCP: add doh3 to partman regex, add MAC for doh3001 [puppet] - 10https://gerrit.wikimedia.org/r/697858 (https://phabricator.wikimedia.org/T283852) [20:21:24] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host doh3002.wikimedia.org [20:21:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:30] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/697832 (https://phabricator.wikimedia.org/T274788) (owner: 10Ryan Kemper) [20:24:50] (03CR) 10Ssingh: [C: 03+1] DHCP: add doh3 to partman regex, add MAC for doh3001 [puppet] - 10https://gerrit.wikimedia.org/r/697858 (https://phabricator.wikimedia.org/T283852) (owner: 10Dzahn) [20:26:21] (03CR) 10Dzahn: [C: 03+2] DHCP: add doh3 to partman regex, add MAC for doh3001 [puppet] - 10https://gerrit.wikimedia.org/r/697858 (https://phabricator.wikimedia.org/T283852) (owner: 10Dzahn) [20:27:42] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host doh3002.wikimedia.org [20:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:59] sukhe: bad news. No valid allocation solutions, failure reasons: FailDisk: 6 [20:29:19] this means we dont have enough resources to fulfill the request for 2 VMs with that size [20:29:25] afaict [20:29:44] ah, so the disk size is a concern? we don't need 30G even, since we don't write anything to disk, other than the packages themselves [20:29:53] if there is a base disk size for our setups, that should be fine [20:30:39] [ganeti3001:~] $ sudo gnt-node list [20:30:39] Node DTotal DFree MTotal MNode MFree Pinst Sinst [20:30:42] ganeti3001.esams.wmnet 390.7G 16.9G 125.4G 15.7G 105.1G 2 3 [20:30:45] ganeti3002.esams.wmnet 390.7G 1.8G 125.4G 1.6G 119.6G 1 6 [20:30:48] ganeti3003.esams.wmnet 390.7G 285.0G 125.4G 29.3G 108.1G 5 0 [20:30:54] see that 1.8G in DFree .. that would be the problem it looks [20:32:01] ah, so I am guessing we would have to do 3003? [20:32:50] in the sense that are there other concerns preventing us from doing 3001 and 3003? [20:32:57] I think we need the space on all 3 nodes because it might migrate machines between nodes [20:33:53] oh hmm [20:34:04] https://wikitech.wikimedia.org/wiki/Ganeti#Verify_cluster_resource_availability [20:34:18] " In theory there should be sufficient disk/memory space on all nodes in the row that you are planning to use, otherwise you might get failures when creating the VM. " [20:34:29] matches the error in that case [20:34:30] I like the "In theory" part of this [20:34:33] ha [20:36:40] but it's also no like it's using them evenly, eh [20:37:09] you would think it does if it reserves the space on all of them.. this must be the "might" and "theory" part [20:37:17] trying to figure out from the docs on what is a solution but it seems like they only talk about adding a disk to a VM [20:38:09] yea, solutions I see are just.. find existing thing we can delete.. try to cut disk in half.. or make ticket to actually request hardware expansion [20:39:27] sukhe: comparing to other VMs in esams.. something like ping3001, ping offloader.. guess the disk size [20:39:31] 5G :) [20:40:04] eqsin looks pretty free, we are not tied to esams, just trying for something that's !US to try out the anycasted service (well it doesn't even have to non-US, just that having it outside will be "better" :) [20:40:12] actually it is 3GB and 1.9 G used, hah [20:40:51] ganeti5001.eqsin.wmnet 390.7G 141.9G 125.5G 33.6G 99.6G 6 0 [20:40:54] ganeti5002.eqsin.wmnet 390.7G 390.7G 125.5G 1.6G 119.7G 0 0 [20:40:57] ganeti5003.eqsin.wmnet 390.7G 141.9G 125.5G 1.3G 122.1G 0 6 [20:41:02] let me delete 3001, and recreate it with just 10GB [20:41:10] if you dont really need disk space [20:41:11] that also works, yeah [20:41:34] dont you ideally want esams AND eqsin? [20:41:38] no, I checked that existing wikidough roles are using ~2GB, +/- the size of the logs [20:41:55] mutante: yes, in the future for sure. it will be on all five [20:42:07] 16:41:38 < sukhe> no, I checked that existing wikidough roles are using ~2GB, +/- the size of the logs [20:42:15] s/no/so, big difference :D [20:42:24] ok, deleting 3001 with the decom script [20:42:48] this removes DNS and all.. which isnt really needed but better than messing with the workflow I guess [20:43:15] thanks for all the help. if it's too complicated, we can drop this for now, open the ticket for hardware expansion and just do eqsin [20:44:52] 309M /var/log/anycast-healthchecker/ [20:44:53] hehe [20:47:13] no worries, it's just running a cookbook [20:47:16] brb [20:49:02] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts doh3001.wikimedia.org [20:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:11] this is a weird special case for the decom stuff where a VM is removed before it was in puppet, worth testing.. looks ok though. a few warnings that can be ignored but finds VM and removes it [20:56:13] yeah we should probablh audit the disk overallocations at all the ganetis [20:56:31] I'm guessing the only real space hog in the remote DCs is apt repos [20:56:56] I should have checked more closely before creating it but there is no harm done, it fails and tells you why [20:57:13] yeah but it's kind of a problem regadless of wikidough, for future planning [20:57:42] (if our apt repo is in there and it's so big that it's gotta take all the ganeti disk space we've got, probably on multiple nodes if we're allowing for migration) [20:57:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:58:09] *nod*, yes, needs capacity planning [20:58:42] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh3001.wikimedia.org [20:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:51] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create two Ganeti VMs for Wikidough in esams - https://phabricator.wikimedia.org/T283852 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `doh3001.wikimedia.org` - doh3001.wikimedia.org (**WARN**) -... [20:58:55] if (and I'm out on a limb with guessing games here) the only serious space user is apt, we could also opt to just have slower installs and get rid of the dc-local apt in the edges) [20:59:25] (until such a time as they're re-deployed with enough disk for it not to matter or whatever) [20:59:29] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:59:34] we did not copy the APT data to each POP [20:59:40] oh [20:59:47] I wonder what the space hog is then [20:59:48] just TFTP with firmware etc [20:59:56] (03CR) 10Ryan Kemper: [C: 03+2] reimage: raid0.default_layout=2 for all installers [puppet] - 10https://gerrit.wikimedia.org/r/697832 (https://phabricator.wikimedia.org/T274788) (owner: 10Ryan Kemper) [20:59:57] ehmm.. yea [21:00:00] let's see [21:01:01] bast - 40G - seems a waste [21:01:37] it's probably because prometheus used to be on that and now is dedicated [21:01:40] 10SRE, 10Wikimedia-Mailing-lists, 10Security, 10Upstream: Implement proper AAA for lists.wikimedia.org (mailman) - https://phabricator.wikimedia.org/T118641 (10Multichill) >>! In T118641#7127096, @Ladsgroup wrote: > Tentatively calling this done given that we now have mailman3, reopen if you think mailman3... [21:01:45] install - 20G [21:02:16] prometheus: 278G :O [21:02:20] bblack: prometheus is [21:02:35] - disk/0: drbd, size 128.0G [21:02:41] - disk/1: drbd, size 150.0G [21:03:29] ncredir, netflow - 20 G each [21:03:41] ping - 5G .. that's it [21:04:02] prometheus stole our resources :) [21:04:02] yeah [21:04:17] I just finally put the command together myself [21:04:18] prometheus3001.esams.wmnet 131072,153600 [21:04:29] gnt-instance list -o name,disk.sizes [21:04:47] fun! [21:05:56] apparently it actually has ~202G in use (96G of that in the rootfs....) [21:08:21] anyways, this is all news to me, the disk needs of prometheus in the edges [21:08:30] https://phabricator.wikimedia.org/T243057 [21:08:36] "Move Prometheus off eqsin/ulsfo/esams bastions" [21:08:48] once this is resolved .. the bastions dont need 40G either [21:09:26] we moved the bastions into ganeti as well, right? [21:09:39] not sure I see an actual request for the VMs [21:09:42] yes, we did [21:09:59] one option is we could perhaps re-purpose the old bare-metal bastions (which are still everywhere) to be hardware prometheus nodes [21:10:11] since the disk size doesn't make it a very VM-friendly thing anyways [21:10:30] I donno, it's something that will have to be the subject of a broader conversation between teams about all the constraints [21:10:30] !log T280382 `sudo -i wmf-auto-reimage-host -p T280382 --new wdqs1004.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `reimage` [21:10:33] !log T280382 T281437 `sudo -i wmf-auto-reimage-host -p T280382 --new wdqs2007.codfw.wmnet` on `ryankemper@cumin2002` tmux session `wdqs_reimage` [21:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:36] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [21:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:40] T281437: hw troubleshooting: ssh unreachable for wdqs2007.codfw.wmnet - https://phabricator.wikimedia.org/T281437 [21:11:09] yes, if it turns out prometheus actually needs all that data and it's not a matter of configuring retention time or something [21:11:23] PROBLEM - Host wdqs2007 is DOWN: PING CRITICAL - Packet loss = 100% [21:11:29] what you said, needs multi teams [21:12:01] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host doh3001.wikimedia.org [21:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:16] recreating doh3001 with 10G [21:13:21] RECOVERY - Host wdqs2007 is UP: PING OK - Packet loss = 0%, RTA = 31.63 ms [21:13:27] PROBLEM - WDQS high update lag on wdqs1013 is CRITICAL: 6.474e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [21:13:29] bblack: also the part that there are 2 virtual disks probably means that it grew larger than originally expected and then one was added [21:17:07] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: connect to address 10.192.16.156 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:17:19] !log `ryankemper@wdqs1013:~$ sudo depool` (catching up on 17.9h lag) [21:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:21] !log ryankemper@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=wdqs2007.codfw.wmnet [21:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:14] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh3001.wikimedia.org [21:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:07] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host doh3002.wikimedia.org [21:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:00] 10SRE, 10Services, 10Patch-For-Review, 10Service-deployment-requests: New Service Request miscweb - https://phabricator.wikimedia.org/T281538 (10Dzahn) [21:30:11] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2007.codfw.wmnet with reason: REIMAGE [21:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:11] 10SRE, 10observability, 10Patch-For-Review: Move Prometheus off eqsin/ulsfo/esams bastions - https://phabricator.wikimedia.org/T243057 (10Dzahn) When we recreate the bastions without prometheus, we don't need to use 40GB disk anymore, right? Can we make them considerably smaller? Because today we actually r... [21:32:21] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2007.codfw.wmnet with reason: REIMAGE [21:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:15] bblack: https://phabricator.wikimedia.org/T277163 [21:37:38] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1004.eqiad.wmnet with reason: REIMAGE [21:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:07] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh3002.wikimedia.org [21:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:22] (03PS1) 10Dzahn: DHCP: update MAC address of doh3001 after recreation [puppet] - 10https://gerrit.wikimedia.org/r/697866 (https://phabricator.wikimedia.org/T283852) [21:39:48] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1004.eqiad.wmnet with reason: REIMAGE [21:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:43] 10SRE, 10observability, 10serviceops-radar, 10User-fgiunchedi: Prometheus PoPs disk space utilization - https://phabricator.wikimedia.org/T277163 (10Dzahn) [21:55:50] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [21:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:01] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1006.eqiad.wmnet --dest wdqs1004.eqiad.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories` on `ryankemper@cumin1001` tmux session `wdqs_reimage` [21:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:05] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [21:59:53] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer [21:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:58] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs2007.codfw.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories` on `ryankemper@cumin2002` tmux session `wdqs_reimage` [22:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:27] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [22:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:22] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:05:09] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [22:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:14] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:06:37] 10SRE, 10SRE-Access-Requests: Add jgianellos and mbsantos to maps-root group - https://phabricator.wikimedia.org/T284135 (10colewhite) p:05Triage→03Medium [22:07:17] 10SRE, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to SUPERSET for CMADEO - https://phabricator.wikimedia.org/T284109 (10colewhite) [22:07:24] !log ryankemper@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wdqs2007.codfw.wmnet [22:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:33] !log ryankemper@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wdqs1004.eqiad.wmnet [22:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:51] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer [22:07:51] (03CR) 10Cwhite: [C: 03+2] admin: add cmadeo to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/697682 (https://phabricator.wikimedia.org/T284109) (owner: 10Cwhite) [22:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:58] (03PS3) 10Cwhite: admin: add cmadeo to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/697682 (https://phabricator.wikimedia.org/T284109) [22:08:02] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [22:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:47] 10SRE, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, and 2 others: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10colewhite) [22:11:00] !log T280382 `sudo -i wmf-auto-reimage-host -p T280382 wdqs1003.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `wdqs_reimage_2` [22:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:04] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [22:11:29] !log T280382 `sudo -i wmf-auto-reimage-host -p T280382 --new wdqs1003.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `wdqs_reimage_2` [22:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:40] 10SRE, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to SUPERSET for CMADEO - https://phabricator.wikimedia.org/T284109 (10colewhite) 05Open→03Resolved The group membership change has been deployed. Please feel free to reopen if you encounter any related issue. [22:17:13] (03PS2) 10Cwhite: admin: add htriedman to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/697677 (https://phabricator.wikimedia.org/T283368) [22:17:25] (03PS2) 10Dzahn: DHCP: update MAC address of doh3001, add doh3002 [puppet] - 10https://gerrit.wikimedia.org/r/697866 (https://phabricator.wikimedia.org/T283852) [22:17:56] 10SRE, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production analytics data and cluster for htriedman - https://phabricator.wikimedia.org/T283368 (10colewhite) [22:18:08] 10SRE, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production analytics data and cluster for htriedman - https://phabricator.wikimedia.org/T283368 (10colewhite) [22:19:53] !log setting charset of all tables in wikitech to binary (T284108 T269348) [22:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:58] T269348: wikitech database has almost all of its varbinary fields wrong - https://phabricator.wikimedia.org/T269348 [22:19:59] T284108: Bring labswiki database tables up to date - https://phabricator.wikimedia.org/T284108 [22:20:14] (03CR) 10Dzahn: [C: 03+2] DHCP: update MAC address of doh3001, add doh3002 [puppet] - 10https://gerrit.wikimedia.org/r/697866 (https://phabricator.wikimedia.org/T283852) (owner: 10Dzahn) [22:21:42] (03CR) 10Cwhite: [C: 03+2] admin: add htriedman to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/697677 (https://phabricator.wikimedia.org/T283368) (owner: 10Cwhite) [22:23:04] 10SRE, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production analytics data and cluster for htriedman - https://phabricator.wikimedia.org/T283368 (10colewhite) 05Open→03Resolved a:05Htriedman→03colewhite The group membership change has been deployed. Please feel f... [22:24:58] (03PS2) 10Cwhite: admin: add cooltey to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/697679 (https://phabricator.wikimedia.org/T283189) [22:25:27] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1003.eqiad.wmnet with reason: REIMAGE [22:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:06] (03CR) 10Cwhite: [C: 03+2] admin: add cooltey to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/697679 (https://phabricator.wikimedia.org/T283189) (owner: 10Cwhite) [22:27:24] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Superset Access for Cooltey Feng - https://phabricator.wikimedia.org/T283189 (10colewhite) 05Open→03Resolved The group membership change has been deployed. Please feel free to reopen if you encounter any related issue. [22:27:40] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1003.eqiad.wmnet with reason: REIMAGE [22:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:31] !log T280382 Cleaned up no-longer-needed files removed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/697832 => `ryankemper@cumin1001:~$ sudo -E cumin -b 6 'P{install*}' 'sudo rm -fv /srv/tftpboot/buster-raid0-installer/pxelinux.cfg'` [22:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:36] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [22:34:23] !log T280382 Cleaned up no-longer-needed files removed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/697832 => `ryankemper@cumin1001:~$ sudo -E cumin -b 2 'P{apt*}' 'sudo rm -rfv /srv/tftpboot/buster-raid0-installer/pxelinux.cfg'` [22:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:39] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for west1 - https://phabricator.wikimedia.org/T284136 (10colewhite) p:05Triage→03Medium a:03colewhite [22:35:21] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 apparently disregarding nonmember addresses set to accept - https://phabricator.wikimedia.org/T284182 (10colewhite) p:05Triage→03Medium [22:38:12] (03PS2) 10Cwhite: admin: add kgordon to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/697684 (https://phabricator.wikimedia.org/T283057) [22:39:20] (03CR) 10Cwhite: [C: 03+2] admin: add kgordon to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/697684 (https://phabricator.wikimedia.org/T283057) (owner: 10Cwhite) [22:39:55] !log T280382 `sudo -i wmf-auto-reimage-host -p T280382 --new wdqs2003.codfw.wmnet` on `ryankemper@cumin2002` tmux session `wdqs_reimage` [22:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:59] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [22:40:39] 10SRE, 10Analytics, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to Superset/Turnilo for Kgordon - https://phabricator.wikimedia.org/T283057 (10colewhite) [22:41:15] PROBLEM - Host wdqs2003 is DOWN: PING CRITICAL - Packet loss = 100% [22:41:19] (03PS1) 10Ladsgroup: Reduce message parse in GadgetHooks::getPreferences (second time) [extensions/Gadgets] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/697816 (https://phabricator.wikimedia.org/T58633) [22:41:29] (03PS2) 10Ladsgroup: Reduce message parse in GadgetHooks::getPreferences (second time) [extensions/Gadgets] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/697816 (https://phabricator.wikimedia.org/T58633) [22:41:44] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create two Ganeti VMs for Wikidough in esams - https://phabricator.wikimedia.org/T283852 (10Dzahn) We ran out of resources on esams ganeti (disk). Creating machines with just 10G instead after discussion. [22:41:49] RECOVERY - Host wdqs2003 is UP: PING WARNING - Packet loss = 50%, RTA = 31.57 ms [22:42:04] (03CR) 10Cwhite: [C: 03+1] alerts: reload prometheus instances after deploy [puppet] - 10https://gerrit.wikimedia.org/r/697737 (https://phabricator.wikimedia.org/T282806) (owner: 10Filippo Giunchedi) [22:42:13] (03PS1) 10Ladsgroup: Allow html form field option 'options-messages' to get parsed [core] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/697817 (https://phabricator.wikimedia.org/T58633) [22:42:19] (03CR) 10jerkins-bot: [V: 04-1] Reduce message parse in GadgetHooks::getPreferences (second time) [extensions/Gadgets] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/697816 (https://phabricator.wikimedia.org/T58633) (owner: 10Ladsgroup) [22:42:22] 10SRE, 10Analytics, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to Superset/Turnilo for Kgordon - https://phabricator.wikimedia.org/T283057 (10colewhite) 05Open→03Resolved a:03colewhite The group membership change has been deployed. Please feel free to reopen if you encounter any relate... [22:43:20] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/697722 (https://phabricator.wikimedia.org/T282806) (owner: 10Filippo Giunchedi) [22:44:19] (03CR) 10Cwhite: [C: 03+1] alertmanager: attach runbook/dashboard URLs to IRC messages [puppet] - 10https://gerrit.wikimedia.org/r/697721 (https://phabricator.wikimedia.org/T282806) (owner: 10Filippo Giunchedi) [22:44:35] (03CR) 10Ladsgroup: [C: 03+2] Enable wgVectorConsolidateUserLinks on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697855 (https://phabricator.wikimedia.org/T266536) (owner: 10Jdlrobson) [22:44:39] PROBLEM - WDQS SPARQL on wdqs2003 is CRITICAL: connect to address 10.192.0.29 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:44:55] (03CR) 10Ladsgroup: [C: 03+2] Allow html form field option 'options-messages' to get parsed [core] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/697817 (https://phabricator.wikimedia.org/T58633) (owner: 10Ladsgroup) [22:45:26] (03Merged) 10jenkins-bot: Enable wgVectorConsolidateUserLinks on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697855 (https://phabricator.wikimedia.org/T266536) (owner: 10Jdlrobson) [22:47:55] (03CR) 10Ladsgroup: "recheck" [extensions/Gadgets] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/697816 (https://phabricator.wikimedia.org/T58633) (owner: 10Ladsgroup) [22:48:05] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:697855|Enable wgVectorConsolidateUserLinks on the beta cluster (T266536)]] (duration: 00m 57s) [22:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:09] T266536: [EPIC] Consolidate user links into a single menu - https://phabricator.wikimedia.org/T266536 [22:48:29] Jdlrobson: deployed, this is noop in production, it can go at any time [22:54:44] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2003.codfw.wmnet with reason: REIMAGE [22:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:51] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2003.codfw.wmnet with reason: REIMAGE [22:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:49] Amir1: oh thanks! [22:58:50] awesome [22:59:05] Amir1: will be looking at your MobileFrontend patch on thur/fri [22:59:41] Thanks [23:00:04] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210602T2300) [23:00:04] Jdlrobson and Amir1: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:11] lol [23:00:14] I can self-serve [23:03:21] w00t [23:05:11] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create two Ganeti VMs for Wikidough in esams - https://phabricator.wikimedia.org/T283852 (10Dzahn) VMs have been created. doh3001 and doh3002.wikimedia.org are up and running with "insetup". Feel free to use them now. They have about 1.8GB used... [23:06:41] (03Merged) 10jenkins-bot: Allow html form field option 'options-messages' to get parsed [core] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/697817 (https://phabricator.wikimedia.org/T58633) (owner: 10Ladsgroup) [23:06:54] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create two Ganeti VMs for Wikidough in esams - https://phabricator.wikimedia.org/T283852 (10Dzahn) ` [doh3001:~] $ df -h Filesystem Size Used Avail Use% Mounted on .. /dev/vda1 8.9G 1.6G 6.8G 19% / [doh3002:~] $ df -h Filesystem... [23:10:36] 10SRE, 10Traffic, 10vm-requests: Please create two Ganeti VMs for Wikidough in esams - https://phabricator.wikimedia.org/T283852 (10Dzahn) 05Open→03Resolved [23:11:11] 10SRE, 10Traffic, 10vm-requests: Please create two Ganeti VMs for Wikidough in esams - https://phabricator.wikimedia.org/T283852 (10Dzahn) [23:12:13] (03CR) 10Ladsgroup: [C: 03+2] Reduce message parse in GadgetHooks::getPreferences (second time) [extensions/Gadgets] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/697816 (https://phabricator.wikimedia.org/T58633) (owner: 10Ladsgroup) [23:16:26] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:18:46] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.7/includes: Backport: [[gerrit:697817|Allow html form field option 'options-messages' to get parsed (T58633)]] (duration: 01m 01s) [23:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:51] T58633: Preference retrieval should not require so much parsing - https://phabricator.wikimedia.org/T58633 [23:24:07] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [23:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:42] PROBLEM - WDQS high update lag on wdqs2008 is CRITICAL: 4430 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [23:26:00] !log T280382 `wdqs2007.codfw.wmnet` has been re-imaged and had the appropriate wikidata/categories journal files transferred. `df -h` shows disk space is no longer an issue following the switch to `raid0`: `/dev/mapper/vg0-srv 2.7T 998G 1.6T 39% /srv` [23:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:04] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [23:28:19] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer [23:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:30] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2007.codfw.wmnet --dest wdqs2003.codfw.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories` on `ryankemper@cumin2002` tmux session `wdqs_reimage` [23:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:24] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10JAnstee_WMF) #operations. Having trouble gaining access despite approved production access. Verbose output: Last login: Wed Jun 2 16:14:12 on ttys000 janstee... [23:33:35] (03Merged) 10jenkins-bot: Reduce message parse in GadgetHooks::getPreferences (second time) [extensions/Gadgets] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/697816 (https://phabricator.wikimedia.org/T58633) (owner: 10Ladsgroup) [23:38:43] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [23:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:47] !log ladsgroup@deploy1002 scap failed: average error rate on 4/9 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/83629bcb5560d11e61d3085c89dd9ed6 for details) [23:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:24] it worked fine in mwdebug :/ [23:47:26] !log T280382 `wdqs1004.eqiad.wmnet` has been re-imaged and had the appropriate wikidata/categories journal files transferred. `df -h` shows disk space is no longer an issue following the switch to `raid0`: `/dev/md2 2.9T 998G 1.8T 36% /srv` [23:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:30] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [23:49:34] RECOVERY - WDQS high update lag on wdqs2008 is OK: (C)3600 ge (W)1200 ge 1001 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [23:51:56] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: ssh unreachable for wdqs2007.codfw.wmnet - https://phabricator.wikimedia.org/T281437 (10RKemper) Re-image completed successfully, and I went ahead and set the host to `Active` in Netbox. This is done. [23:53:52] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [23:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:23] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [23:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:42] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1004.eqiad.wmnet --dest wdqs1003.eqiad.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories` on `ryankemper@cumin1001` tmux session `wdqs_reimage` [23:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:46] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [23:57:03] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer [23:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:23] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2007.codfw.wmnet --dest wdqs2003.codfw.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin2002` tmux session `wdqs_reimage` [23:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log