[00:03:29] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 221, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:04:09] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:13:32] (03PS1) 10Bstorm: lighttpd surprises: we don't get python2 in php images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/708187 (https://phabricator.wikimedia.org/T287421) [01:03:09] (03PS1) 10Sharvaniharan: Remove IR schema config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708188 [01:12:51] PROBLEM - dump of s6 in eqiad on alert1001 is CRITICAL: Last dump for s6 at eqiad (db1140.eqiad.wmnet:3316) taken on 2021-07-27 00:00:01 is 109 GB, but previous one was 92 GB, a change of 18.3% https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [01:20:06] (03CR) 10Labdajiwa: Change Javanese Wiktionary logo (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708065 (https://phabricator.wikimedia.org/T287425) (owner: 10Labdajiwa) [02:00:05] Deploy window Branching MediaWiki, extensions, skins, and vendor – See Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210727T0200) [02:07:10] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.37.0-wmf.16 [core] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708196 [02:07:12] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.37.0-wmf.16 [core] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708196 (owner: 10TrainBranchBot) [02:19:39] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:24:28] (03Merged) 10jenkins-bot: Branch commit for wmf/1.37.0-wmf.16 [core] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708196 (owner: 10TrainBranchBot) [02:44:45] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:46:03] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:25:59] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:41:12] (03CR) 10Ladsgroup: [C: 04-1] analytics: Migrate clean_jupyter_user_local_trash to systemd timer (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/708183 (https://phabricator.wikimedia.org/T286442) (owner: 10Legoktm) [04:56:53] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [05:01:53] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [05:11:51] (03PS1) 10Marostegui: db1162: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/708202 (https://phabricator.wikimedia.org/T287230) [05:12:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1162 T287230', diff saved to https://phabricator.wikimedia.org/P16899 and previous config saved to /var/cache/conftool/dbconfig/20210727-051212-marostegui.json [05:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:21] T287230: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 [05:12:32] (03CR) 10Marostegui: [C: 03+2] db1162: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/708202 (https://phabricator.wikimedia.org/T287230) (owner: 10Marostegui) [05:13:50] (03PS1) 10Marostegui: install_server: Reimage db1162 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/708203 (https://phabricator.wikimedia.org/T287230) [05:15:52] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db1162 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/708203 (https://phabricator.wikimedia.org/T287230) (owner: 10Marostegui) [05:26:49] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:31:53] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [05:32:12] (03PS1) 10Ladsgroup: Enable request language for RDF stubs in testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708204 (https://phabricator.wikimedia.org/T285795) [05:32:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1162.eqiad.wmnet with reason: REIMAGE [05:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1162.eqiad.wmnet with reason: REIMAGE [05:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:53] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [05:45:04] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 75 probes of 628 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:50:22] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 43 probes of 628 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:05:40] (03PS2) 10Labdajiwa: Set the project namespace and sitename for Javanese Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708206 (https://phabricator.wikimedia.org/T287437) [06:08:28] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Krassotkin) @Ladsgroup I'm sorry but I'm not trying to be loud. I'm just try... [06:10:58] (03CR) 10Ladsgroup: [C: 03+2] "deploying now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708204 (https://phabricator.wikimedia.org/T285795) (owner: 10Ladsgroup) [06:11:40] (03Merged) 10jenkins-bot: Enable request language for RDF stubs in testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708204 (https://phabricator.wikimedia.org/T285795) (owner: 10Ladsgroup) [06:18:28] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:708204|Enable request language for RDF stubs in testwikidatawiki (T285795)]], Part I (duration: 00m 57s) [06:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:37] T285795: Limit languages on EntityStub rdf builders - https://phabricator.wikimedia.org/T285795 [06:20:44] !log ladsgroup@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:708204|Enable request language for RDF stubs in testwikidatawiki (T285795)]], Part II (duration: 00m 56s) [06:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:49] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [06:24:23] (03PS1) 10Marostegui: install_server: Reimage db1129 to buster [puppet] - 10https://gerrit.wikimedia.org/r/708239 (https://phabricator.wikimedia.org/T287230) [06:25:27] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db1129 to buster [puppet] - 10https://gerrit.wikimedia.org/r/708239 (https://phabricator.wikimedia.org/T287230) (owner: 10Marostegui) [06:29:12] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10RKemper) [06:30:42] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Firestar464) >>! In T287362#7236128, @Krassotkin wrote: > @Aklapper Pleas... [06:33:07] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Firestar464) >>! In T287362#7238579, @Krassotkin wrote: > @Ladsgroup I'm sor... [06:41:46] 10SRE, 10DynamicPageList (Wikimedia), 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Krassotkin) We cannot implement the functionality of news feeds through bots. Wikinews contains news feeds both on the home page and in all categories. Any... [06:41:49] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [06:46:38] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: point cumin_masters in pontoon to cloudinfra hosts [puppet] - 10https://gerrit.wikimedia.org/r/708042 (https://phabricator.wikimedia.org/T287269) (owner: 10Filippo Giunchedi) [06:50:18] !log install iptables from buster-backports (manually) on ml-serve-ctrl200[1,2] as test (+ reboot the nodes for a clean start) - T287238 [06:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:25] T287238: ML Serve controller vms show a slowly increasing resource usage leak over time - https://phabricator.wikimedia.org/T287238 [06:52:51] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Krassotkin) @Firestar464 The project does not seem to give an error but it i... [06:59:03] PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:59:27] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: wait for apache2 in wmcs::instance sites-local [puppet] - 10https://gerrit.wikimedia.org/r/707297 (https://phabricator.wikimedia.org/T283531) (owner: 10Filippo Giunchedi) [07:11:49] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [07:14:01] !log installing krb security updates on buster [07:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:53] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:21:49] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [07:23:40] (03PS3) 10Labdajiwa: Set the project namespace and sitename for Javanese Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708206 (https://phabricator.wikimedia.org/T287437) [07:24:32] 10SRE, 10DynamicPageList (Wikimedia), 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Bawolff) > On the other hand, in the last discussion, it was decided to rewrite the DPL to CirrusSearch. Decided is a strong word. It was suggested as a poss... [07:38:22] 10SRE, 10Data-Persistence-Backup, 10bacula: Restore ~tjones/reindex directory from mwmaint1002 - https://phabricator.wikimedia.org/T287304 (10jcrespo) We plan to work on this today. Sadly, for some reason, phabricator didn't send me any email about this issue until the end of my day yesterday, so I had to ge... [07:45:18] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10elukey) [07:46:21] I need to do some maintenance on mwmaint1002 [07:46:28] you probably shouldn't use that anyway, but if you have to, please ping me beforehand [07:46:44] that's what the server is meant for anyways? :D [07:48:45] well, I should specify I am going to "break it" for a while [07:48:48] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10MoritzMuehlenhoff) [07:48:59] (stop puppet and other stuff) [07:49:35] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10MoritzMuehlenhoff) [07:50:28] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/708130 (owner: 10Jbond) [07:52:29] !log disabling puppet on mwmaint1002 [07:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:55] RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:00:50] (03PS1) 10Filippo Giunchedi: pontoon: allow access to cloud-cumin-* hosts as cumin_masters [puppet] - 10https://gerrit.wikimedia.org/r/708243 (https://phabricator.wikimedia.org/T287269) [08:04:02] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Joe) Just to clarify, given @Ladsgroup has been the target of severe on-wiki... [08:06:15] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/708106 (owner: 10Filippo Giunchedi) [08:07:56] 10SRE, 10Datacenter-Switchover: Add step to rsync home dirs on mwmaint hosts before DC switchover - https://phabricator.wikimedia.org/T287303 (10Volans) >>! In T287303#7238116, @Legoktm wrote: > While I agree with you, that's not the current reality. In any case it's not just scripts, it's also log files, some... [08:09:39] (03CR) 10Zabe: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708206 (https://phabricator.wikimedia.org/T287437) (owner: 10Labdajiwa) [08:11:25] 10SRE, 10Data-Persistence-Backup, 10bacula: Restore ~tjones/reindex directory from mwmaint1002 - https://phabricator.wikimedia.org/T287304 (10jcrespo) @TJones I've restored your old home folder onto mwmaint1002:/home/tjones/backup-restore-2021-07-13--05-05-51 It should have the same access permissions as th... [08:12:06] (03PS1) 10Volans: Update to v2.10.4-wmf5 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/708244 [08:13:22] (03CR) 10David Caro: [C: 03+2] ceph: Add CephOSDFlag object [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/705872 (owner: 10David Caro) [08:13:25] (03CR) 10David Caro: [C: 03+2] ceph: Add CephStatus tests [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/705873 (owner: 10David Caro) [08:13:28] (03CR) 10David Caro: [C: 03+2] ceph: fix typo Satus->Status [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/705874 (owner: 10David Caro) [08:13:31] (03CR) 10David Caro: [C: 03+2] ceph: Added CephClusterController tests and a couple fixes [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/706343 (owner: 10David Caro) [08:13:35] (03CR) 10David Caro: [C: 03+2] ceph: fix a typo [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/708078 (owner: 10David Caro) [08:13:37] (03CR) 10David Caro: [C: 03+2] ceph: Added tests to CephOSDController [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/708079 (owner: 10David Caro) [08:16:04] (03Merged) 10jenkins-bot: ceph: Add CephOSDFlag object [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/705872 (owner: 10David Caro) [08:16:20] (03CR) 10David Caro: [C: 03+2] ceph: add latency monitoring stats [puppet] - 10https://gerrit.wikimedia.org/r/700182 (https://phabricator.wikimedia.org/T281254) (owner: 10David Caro) [08:16:38] (03CR) 10David Caro: [C: 03+2] ceph: add latency monitoring stats (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/700182 (https://phabricator.wikimedia.org/T281254) (owner: 10David Caro) [08:16:46] (03CR) 10Zabe: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708065 (https://phabricator.wikimedia.org/T287425) (owner: 10Labdajiwa) [08:16:57] (03Merged) 10jenkins-bot: ceph: Add CephStatus tests [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/705873 (owner: 10David Caro) [08:16:59] (03Merged) 10jenkins-bot: ceph: fix typo Satus->Status [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/705874 (owner: 10David Caro) [08:17:01] (03Merged) 10jenkins-bot: ceph: Added CephClusterController tests and a couple fixes [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/706343 (owner: 10David Caro) [08:17:03] (03Merged) 10jenkins-bot: ceph: fix a typo [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/708078 (owner: 10David Caro) [08:17:42] (03Merged) 10jenkins-bot: ceph: Added tests to CephOSDController [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/708079 (owner: 10David Caro) [08:20:38] (03PS1) 10David Caro: prometheus.node-pinger: fix script source path [puppet] - 10https://gerrit.wikimedia.org/r/708246 [08:21:35] (03CR) 10David Caro: [C: 03+2] prometheus.node-pinger: fix script source path [puppet] - 10https://gerrit.wikimedia.org/r/708246 (owner: 10David Caro) [08:27:03] (03CR) 10Volans: [V: 03+2 C: 03+2] Update to v2.10.4-wmf5 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/708244 (owner: 10Volans) [08:28:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2147 to restart mysql', diff saved to https://phabricator.wikimedia.org/P16900 and previous config saved to /var/cache/conftool/dbconfig/20210727-082820-marostegui.json [08:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:51] !log volans@deploy1002 Started deploy [netbox/deploy@660ad14]: Test v2.10.4-wmf5 on netbox-next [08:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:25] RECOVERY - MariaDB memory on db2147 is OK: OK Memory 74% used https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:29:52] !log volans@deploy1002 Finished deploy [netbox/deploy@660ad14]: Test v2.10.4-wmf5 on netbox-next (duration: 01m 01s) [08:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:28] (03CR) 10Lucas Werkmeister (WMDE): Disable mobile contributions simplifications on Wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708158 (https://phabricator.wikimedia.org/T283988) (owner: 10Jdlrobson) [08:31:49] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01047 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [08:32:10] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10User-jijiki: The mediawiki-webserver image should only log in json format - https://phabricator.wikimedia.org/T285384 (10Joe) 05Open→03Resolved [08:33:58] (03PS1) 10David Caro: ceph: use the correct var (osd::hosts) for the lookup [puppet] - 10https://gerrit.wikimedia.org/r/708247 [08:35:43] (03CR) 10David Caro: [C: 03+2] ceph: use the correct var (osd::hosts) for the lookup [puppet] - 10https://gerrit.wikimedia.org/r/708247 (owner: 10David Caro) [08:35:45] (03PS2) 10Giuseppe Lavagetto: service::catalog: lower the depool threshold for api [puppet] - 10https://gerrit.wikimedia.org/r/708072 [08:35:47] (03PS1) 10Giuseppe Lavagetto: profile::trafficserver: fix mwdebug on k8s support [puppet] - 10https://gerrit.wikimedia.org/r/708248 (https://phabricator.wikimedia.org/T283056) [08:36:09] !log reenabled puppet on mwmaint1002 [08:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:34] (03CR) 10Giuseppe Lavagetto: [C: 03+2] service::catalog: lower the depool threshold for api [puppet] - 10https://gerrit.wikimedia.org/r/708072 (owner: 10Giuseppe Lavagetto) [08:37:37] (03CR) 10jerkins-bot: [V: 04-1] profile::trafficserver: fix mwdebug on k8s support [puppet] - 10https://gerrit.wikimedia.org/r/708248 (https://phabricator.wikimedia.org/T283056) (owner: 10Giuseppe Lavagetto) [08:38:34] (03PS1) 10Filippo Giunchedi: thanos: fix rule port [puppet] - 10https://gerrit.wikimedia.org/r/708249 (https://phabricator.wikimedia.org/T287142) [08:39:51] (03CR) 10Muehlenhoff: Class API: add on_error() method (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/705720 (owner: 10Volans) [08:40:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2147 (re)pooling @ 5%: After mariadb restart and upgraed', diff saved to https://phabricator.wikimedia.org/P16901 and previous config saved to /var/cache/conftool/dbconfig/20210727-084031-root.json [08:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:02] (03CR) 10Volans: Class API: add on_error() method (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/705720 (owner: 10Volans) [08:50:57] (03CR) 10Jelto: [C: 03+2] move gitlab rails exporter to port 8083 [puppet] - 10https://gerrit.wikimedia.org/r/707859 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [08:51:01] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [08:51:38] (03PS2) 10Jelto: move gitlab rails exporter to port 8083 [puppet] - 10https://gerrit.wikimedia.org/r/707859 (https://phabricator.wikimedia.org/T275170) [08:55:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2147 (re)pooling @ 10%: After mariadb restart and upgraed', diff saved to https://phabricator.wikimedia.org/P16902 and previous config saved to /var/cache/conftool/dbconfig/20210727-085535-root.json [08:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:38] <_joe_> !log restart pybal on lvs2010 to pick up the depool threshold change [08:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:39] <_joe_> !log repooling mw225[12] for apis [08:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:47] (03CR) 10Vgutierrez: [C: 03+1] profile::trafficserver: fix mwdebug on k8s support [puppet] - 10https://gerrit.wikimedia.org/r/708248 (https://phabricator.wikimedia.org/T283056) (owner: 10Giuseppe Lavagetto) [09:00:13] (03PS2) 10Giuseppe Lavagetto: profile::trafficserver: fix mwdebug on k8s support [puppet] - 10https://gerrit.wikimedia.org/r/708248 (https://phabricator.wikimedia.org/T283056) [09:02:22] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: fix rule port [puppet] - 10https://gerrit.wikimedia.org/r/708249 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [09:03:43] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) [09:04:11] <_joe_> !log restarting pybal on lvs2009 to pick up the new api depool threshold [09:04:16] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) [09:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:19] (03CR) 10Vgutierrez: [C: 03+1] profile::trafficserver: fix mwdebug on k8s support [puppet] - 10https://gerrit.wikimedia.org/r/708248 (https://phabricator.wikimedia.org/T283056) (owner: 10Giuseppe Lavagetto) [09:06:52] (03PS1) 10Dzahn: scap: replace scap proxy mw1285 with mw1306 [puppet] - 10https://gerrit.wikimedia.org/r/708252 [09:06:55] jouncebot: next [09:06:55] In 1 hour(s) and 53 minute(s): European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210727T1100) [09:07:29] (03PS2) 10Dzahn: scap: replace scap proxy mw1285 with mw1306 [puppet] - 10https://gerrit.wikimedia.org/r/708252 [09:08:57] (03CR) 10Dzahn: [C: 03+2] scap: replace scap proxy mw1285 with mw1306 [puppet] - 10https://gerrit.wikimedia.org/r/708252 (owner: 10Dzahn) [09:09:30] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::trafficserver: fix mwdebug on k8s support [puppet] - 10https://gerrit.wikimedia.org/r/708248 (https://phabricator.wikimedia.org/T283056) (owner: 10Giuseppe Lavagetto) [09:10:24] 10SRE, 10Services, 10Wikibase-Quality-Constraints, 10Wikidata, and 3 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Michael) Is there an update on this? Anything we (WMDE) can do to help this move forward? [09:10:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2147 (re)pooling @ 15%: After mariadb restart and upgraed', diff saved to https://phabricator.wikimedia.org/P16904 and previous config saved to /var/cache/conftool/dbconfig/20210727-091038-root.json [09:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:06] 10SRE, 10Services, 10Wikibase-Quality-Constraints, 10Wikidata, and 3 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Ladsgroup) I'm on it [09:11:51] (03PS1) 10Dzahn: site/conftool/DHCP: decom mw1285 [puppet] - 10https://gerrit.wikimedia.org/r/708253 (https://phabricator.wikimedia.org/T280203) [09:12:30] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw1285.eqiad.wmnet [09:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:22] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw1285.eqiad.wmnet [09:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:34] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) [09:16:51] (03CR) 10Jbond: [V: 03+1 C: 03+2] wmflib::dir::mkdir_p: Use ensure_resource for full directory [puppet] - 10https://gerrit.wikimedia.org/r/708130 (owner: 10Jbond) [09:17:34] (03CR) 10Jbond: [C: 03+2] Move systemd presets to /run [puppet] - 10https://gerrit.wikimedia.org/r/708121 (owner: 10Muehlenhoff) [09:22:38] (03PS1) 10Dzahn: site/conftool/scap: decom mw1269, replace scap proxy with mw1420 [puppet] - 10https://gerrit.wikimedia.org/r/708254 (https://phabricator.wikimedia.org/T280203) [09:24:04] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10Joe) [09:24:51] (03PS1) 10Giuseppe Lavagetto: Add the experimental kubernetes backend to mwdebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708255 (https://phabricator.wikimedia.org/T283056) [09:25:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2147 (re)pooling @ 25%: After mariadb restart and upgraed', diff saved to https://phabricator.wikimedia.org/P16905 and previous config saved to /var/cache/conftool/dbconfig/20210727-092542-root.json [09:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:13] (03PS1) 10Jbond: debian::autostart: force the creation of the symlink [puppet] - 10https://gerrit.wikimedia.org/r/708256 [09:27:54] (03CR) 10Jbond: [C: 03+2] debian::autostart: force the creation of the symlink [puppet] - 10https://gerrit.wikimedia.org/r/708256 (owner: 10Jbond) [09:32:14] 10SRE, 10DynamicPageList (Wikimedia), 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Bawolff) >I'd argue we have to disable DPL from everywhere, this has potential to cause a full outage in our system but from any wiki that has it turned on. I... [09:34:37] (03PS1) 10Dzahn: hiera/appservers: remove mcrouter proxy values for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/708257 (https://phabricator.wikimedia.org/T280203) [09:36:03] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Krassotkin) @Joe You are the third person to tell me about @Ladsgroup's hara... [09:37:46] (03PS1) 10Elukey: WIP profile::kubernetes::node: deploy iptables from buster-backports [puppet] - 10https://gerrit.wikimedia.org/r/708258 (https://phabricator.wikimedia.org/T287238) [09:38:16] (03PS1) 10Jcrespo: query killer: Emergency patch for s3 [software] - 10https://gerrit.wikimedia.org/r/708259 [09:40:00] (03CR) 10Dzahn: "Effie, Joe, can these be removed or is this too early?" [puppet] - 10https://gerrit.wikimedia.org/r/708257 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [09:40:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2147 (re)pooling @ 50%: After mariadb restart and upgraed', diff saved to https://phabricator.wikimedia.org/P16906 and previous config saved to /var/cache/conftool/dbconfig/20210727-094046-root.json [09:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:57] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [09:41:24] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw1285.eqiad.wmnet [09:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:32] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1285.eqiad.wmnet` - m... [09:41:56] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [09:48:36] (03Abandoned) 10Jcrespo: query killer: Emergency patch for s3 [software] - 10https://gerrit.wikimedia.org/r/708259 (owner: 10Jcrespo) [09:50:13] (03CR) 10Dzahn: [C: 03+2] site/conftool/DHCP: decom mw1285 [puppet] - 10https://gerrit.wikimedia.org/r/708253 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [09:50:34] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10elukey) [09:51:19] (03PS5) 10H.krishna123: api_db: Add code to enable database connection, add API to obtain recent backups data and freshness, load config from alerting yaml file, add tests [software/bernard] - 10https://gerrit.wikimedia.org/r/702781 (https://phabricator.wikimedia.org/T285142) [09:52:49] !log reverting query killer parameters on s3 codfw replicas [09:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:04] jouncebot: now [09:54:04] No deployments scheduled for the next 1 hour(s) and 5 minute(s) [09:54:19] PROBLEM - Check systemd state on ml-serve-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens6.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:54:21] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [09:54:50] (03PS1) 10Filippo Giunchedi: pontoon: set thanos retention to one week [puppet] - 10https://gerrit.wikimedia.org/r/708262 [09:55:50] (03CR) 10Dzahn: [C: 03+2] site/conftool/scap: decom mw1269, replace scap proxy with mw1420 [puppet] - 10https://gerrit.wikimedia.org/r/708254 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [09:55:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2147 (re)pooling @ 75%: After mariadb restart and upgraed', diff saved to https://phabricator.wikimedia.org/P16908 and previous config saved to /var/cache/conftool/dbconfig/20210727-095549-root.json [09:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:57] (03PS2) 10Dzahn: site/conftool/scap: decom mw1269, replace scap proxy with mw1420 [puppet] - 10https://gerrit.wikimedia.org/r/708254 (https://phabricator.wikimedia.org/T280203) [09:56:07] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: set thanos retention to one week [puppet] - 10https://gerrit.wikimedia.org/r/708262 (owner: 10Filippo Giunchedi) [09:56:15] RECOVERY - Check systemd state on ml-serve-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:56:51] PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve-ctrl2001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:58:47] ah! [10:02:54] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Firestar464) Can we stop whining about the DPL and get back to improving Wik... [10:05:22] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw1269.eqiad.wmnet [10:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:52] !log running gitlab-ansible playbook on gitlab2001.wikimedia.org [10:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:11] 10SRE, 10DynamicPageList (Wikimedia), 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Ladsgroup) Thanks for the info and the numbers, I will definitely take a deep look at it and see what we can do. One thing I want to add to see where I'm comi... [10:09:09] (03CR) 10Majavah: [C: 03+1] "While this is probably the best way to fix the breakage my Python 3 porting patch caused, let's not include this for Bullseye containers." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/708187 (https://phabricator.wikimedia.org/T287421) (owner: 10Bstorm) [10:10:32] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw1269.eqiad.wmnet [10:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2147 (re)pooling @ 100%: After mariadb restart and upgraed', diff saved to https://phabricator.wikimedia.org/P16909 and previous config saved to /var/cache/conftool/dbconfig/20210727-101053-root.json [10:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:12] 10SRE, 10DynamicPageList (Wikimedia), 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Ladsgroup) oh cat_pages exist in category and we don't need to rely on categorylinks. That's pretty good. [10:11:41] !log replacing scap proxies: mw1269 with mw1420, mw1285 with mw1306 [10:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:23] (03PS1) 10Jbond: debian::autostart: correct typo [puppet] - 10https://gerrit.wikimedia.org/r/708263 [10:12:30] 10SRE, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Vgutierrez) @Papaul maybe I'm missing some limitation, but there is any reason to just be using two switches per row on codfw for LVS networking links? Addi... [10:13:02] (03CR) 10Jbond: [C: 03+2] debian::autostart: correct typo [puppet] - 10https://gerrit.wikimedia.org/r/708263 (owner: 10Jbond) [10:13:04] Amir1: however, cat_pages is pretty notorious for being incorrect. I don't know what the current status is, but it used to be updated in a different transaction and would get out of sync. But it probably provides reasonable order of magnitude estimates [10:13:29] yeah [10:13:47] bawolff: one thing is I want to know if there is other ways to abuse DPL [10:13:55] not just intersection [10:16:47] !log gitlab-ansible playbook on gitlab2001.wikimedia.org END (PASS) [10:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:08] (03CR) 10Dzahn: [C: 04-2] "depends on https://gerrit.wikimedia.org/r/c/operations/puppet/+/705852" [puppet] - 10https://gerrit.wikimedia.org/r/708257 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [10:21:19] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [10:22:22] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [10:25:39] (03PS1) 10Jgiannelos: maps: Disable tilerator on maps1009 [puppet] - 10https://gerrit.wikimedia.org/r/708264 [10:25:48] (03PS2) 10Elukey: profile::kubernetes::node: add component/iptables for Buster [puppet] - 10https://gerrit.wikimedia.org/r/708258 (https://phabricator.wikimedia.org/T287238) [10:25:50] (03PS1) 10Elukey: aptrepo: add a component for iptables to wikimedia-buster [puppet] - 10https://gerrit.wikimedia.org/r/708265 (https://phabricator.wikimedia.org/T287238) [10:26:34] (03PS1) 10Filippo Giunchedi: Fix non-test yaml file globbing [alerts] - 10https://gerrit.wikimedia.org/r/708266 [10:26:56] RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve-ctrl2001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:27:49] (03PS2) 10Elukey: aptrepo: add a component for iptables to wikimedia-buster [puppet] - 10https://gerrit.wikimedia.org/r/708265 (https://phabricator.wikimedia.org/T287238) [10:27:51] (03PS3) 10Elukey: profile::kubernetes::node: add component/iptables for Buster [puppet] - 10https://gerrit.wikimedia.org/r/708258 (https://phabricator.wikimedia.org/T287238) [10:32:18] (03PS3) 10Elukey: aptrepo: add a component for iptables to wikimedia-buster [puppet] - 10https://gerrit.wikimedia.org/r/708265 (https://phabricator.wikimedia.org/T287238) [10:32:20] (03PS4) 10Elukey: profile::kubernetes::node: add component/iptables for Buster [puppet] - 10https://gerrit.wikimedia.org/r/708258 (https://phabricator.wikimedia.org/T287238) [10:34:47] (03CR) 10Jelto: "Metrics on gitlab1001 and gitlab2001 are working as expected and are reachable by Prometheus now." [puppet] - 10https://gerrit.wikimedia.org/r/707860 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [10:36:38] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/708265 (https://phabricator.wikimedia.org/T287238) (owner: 10Elukey) [10:37:52] <_joe_> jouncebot: next [10:37:52] In 0 hour(s) and 22 minute(s): European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210727T1100) [10:39:03] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw1269.eqiad.wmnet [10:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:09] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/708258 (https://phabricator.wikimedia.org/T287238) (owner: 10Elukey) [10:39:12] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1269.eqiad.wmnet` - m... [10:39:56] (03CR) 10Jelto: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30353/console" [puppet] - 10https://gerrit.wikimedia.org/r/704503 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [10:42:45] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30354/console" [puppet] - 10https://gerrit.wikimedia.org/r/707860 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [10:43:41] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10DonSimon) Dear @Ladsgroup . If I or Krassotkin insulted you with something e... [10:58:08] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10ArielGlenn) [10:59:23] (03CR) 10Dzahn: "> Patch Set 2: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/706485 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210727T1100). [11:00:05] kart_ and _joe_: A patch you scheduled for European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:19] o/ [11:00:24] <_joe_> o/ [11:00:31] <_joe_> \o [11:00:42] <_joe_> _o [11:00:43] watching that everything is ok after 2 scap proxies were replaced with newer hosts [11:00:49] (eqiad) [11:01:05] <_joe_> Lucas_WMDE: I would mostly need a +1 on my patch, then I can take care of it myself [11:01:11] * Lucas_WMDE looks [11:01:45] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) [11:01:52] * kart_ is here [11:03:02] Also, my patch isn't testable, AFAIK. So, you can just deploy. I need to confirm later once data collection is done by Neil and other people. [11:03:29] _joe_: that’s too mysterious for me to +1 at the moment, sorry [11:03:37] <_joe_> Lucas_WMDE: ack [11:03:49] it sounds exciting though ^^ [11:04:06] <_joe_> https://github.com/wikimedia/operations-mediawiki-config/blob/master/README specifies what debug.json is for, btw :) [11:04:08] (03CR) 10Jbond: "lgtm comment/question inline" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/705720 (owner: 10Volans) [11:04:34] ah, it’s not used to build a dblist? [11:04:44] (I tried SSHing to the host and it didn’t work ^^) [11:04:50] wait, dblist isn’t the right word [11:04:52] but a list of hosts for scap [11:05:21] <_joe_> no, it's just the list that appears in the x-wikimedia-debug extension [11:05:38] <_joe_> the routing at the traffic/application layer already works [11:05:39] alright, then I can +1 it after all [11:05:46] <_joe_> heh ok :) [11:05:46] but let’s do kart_’s change first [11:05:51] <_joe_> sure [11:05:57] (03PS11) 10Lucas Werkmeister (WMDE): Add stream configuration for ContentTranslation events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704456 (https://phabricator.wikimedia.org/T281982) (owner: 10KartikMistry) [11:06:10] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add stream configuration for ContentTranslation events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704456 (https://phabricator.wikimedia.org/T281982) (owner: 10KartikMistry) [11:06:54] (03Merged) 10jenkins-bot: Add stream configuration for ContentTranslation events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704456 (https://phabricator.wikimedia.org/T281982) (owner: 10KartikMistry) [11:07:32] (03CR) 10Jbond: [C: 03+2] policy-rc.d: update policy-rc.d script to handle missing services [puppet] - 10https://gerrit.wikimedia.org/r/708100 (owner: 10Jbond) [11:07:48] change pulled to mwdebug2001, testing [11:08:46] looks fine, let’s sync [11:09:20] Cool! [11:09:49] _joe_: if the extension gets that JSON file from noc.wikimedia.org, could it be tested with a `scap pull` on mwmaint2002? ^^ [11:09:58] (noc is served by one of the maint hosts isn’t it? though I don’t remember which one) [11:10:03] (03CR) 10Vgutierrez: [C: 03+1] Add the experimental kubernetes backend to mwdebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708255 (https://phabricator.wikimedia.org/T283056) (owner: 10Giuseppe Lavagetto) [11:10:10] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:704456|Add stream configuration for ContentTranslation events (T281982)]] (duration: 00m 58s) [11:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:17] T281982: Configure the Event Platform backend to accept events in the content_translation_event stream - https://phabricator.wikimedia.org/T281982 [11:10:23] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Add the experimental kubernetes backend to mwdebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708255 (https://phabricator.wikimedia.org/T283056) (owner: 10Giuseppe Lavagetto) [11:10:36] Thanks Lucas_WMDE [11:10:42] np [11:10:46] _joe_: feel free to deploy [11:12:10] Lucas_WMDE: we swiched that to codfw as well in https://gerrit.wikimedia.org/r/c/operations/dns/+/704293 [11:12:23] ok [11:12:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add the experimental kubernetes backend to mwdebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708255 (https://phabricator.wikimedia.org/T283056) (owner: 10Giuseppe Lavagetto) [11:13:08] (03Merged) 10jenkins-bot: Add the experimental kubernetes backend to mwdebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708255 (https://phabricator.wikimedia.org/T283056) (owner: 10Giuseppe Lavagetto) [11:13:25] Lucas_WMDE: so... no issues or warnings during sync about specific hosts, right? [11:13:33] nope [11:14:02] cool, just wanted to make sure, replaced scap proxies, all good [11:14:55] (03CR) 10Volans: Class API: add on_error() method (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/705720 (owner: 10Volans) [11:17:38] _joe_: x-wikimedia-debug-routing: no match found for the backend specified in X-Wikimedia-Debug [11:17:48] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10MoritzMuehlenhoff) [11:17:51] <_joe_> RhinosF1: I am aware [11:18:31] <_joe_> oh found the issue, it's the ats-level lua regex [11:18:41] <_joe_> ok I can sync [11:18:46] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10MoritzMuehlenhoff) [11:18:46] <_joe_> and fix the ats stuff later [11:19:02] (03PS1) 10Jbond: policy-rc.d: capture all errors [puppet] - 10https://gerrit.wikimedia.org/r/708272 [11:20:32] !log oblivian@deploy1002 Synchronized debug.json: Config: [[gerrit:708255|Add the experimental kubernetes backend to mwdebug (T283056)]] (duration: 00m 56s) [11:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:39] T283056: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 [11:23:05] !log EU backport+config window done [11:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:31] (03PS1) 10Giuseppe Lavagetto: trafficserver: fix regex match for wikimedia debug [puppet] - 10https://gerrit.wikimedia.org/r/708274 [11:26:14] <_joe_> vgutierrez: ^^ (also RhinosF1 if you're curious of what was wrong) [11:27:10] nice catch [11:27:27] ack [11:30:11] (03CR) 10Jbond: [C: 03+2] policy-rc.d: capture all errors [puppet] - 10https://gerrit.wikimedia.org/r/708272 (owner: 10Jbond) [11:31:05] (03CR) 10Giuseppe Lavagetto: [C: 03+2] trafficserver: fix regex match for wikimedia debug [puppet] - 10https://gerrit.wikimedia.org/r/708274 (owner: 10Giuseppe Lavagetto) [11:31:25] <_joe_> RhinosF1: also don't expect much from that endpoint :P [11:31:38] <_joe_> I'll send an email to ops-l explaining what that is for today [11:32:14] ok [11:32:28] i'm mortal so not on ops-l [11:32:48] <_joe_> heh I'll x-post to wikitech-l too :) [11:33:01] :) [11:33:28] <_joe_> but basically, don't expect that to work at all, it might fail at any time :P [11:33:38] <_joe_> it probably won't have profiling enabled right now [11:33:47] <_joe_> and surely it's not updated with new releases [11:37:06] (03PS1) 10Jelto: disable backup cronjobs for gitlab2001 [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/708275 (https://phabricator.wikimedia.org/T285867) [11:40:49] (03CR) 10Jbond: Class API: add on_error() method (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/705720 (owner: 10Volans) [11:48:50] (03PS3) 10Jelto: site/conftool: add mw1447,mw1448,mw1449,mw1450 as canary API appservers [puppet] - 10https://gerrit.wikimedia.org/r/706485 (https://phabricator.wikimedia.org/T279309) [11:50:50] (03PS1) 10Giuseppe Lavagetto: mediawiki: use pod name as value of the Server header [deployment-charts] - 10https://gerrit.wikimedia.org/r/708279 [11:52:43] (03PS2) 10Filippo Giunchedi: Fix non-test yaml file globbing [alerts] - 10https://gerrit.wikimedia.org/r/708266 (https://phabricator.wikimedia.org/T287142) [11:52:45] (03PS1) 10Filippo Giunchedi: Import alerts for Thanos rule [alerts] - 10https://gerrit.wikimedia.org/r/708280 (https://phabricator.wikimedia.org/T287142) [12:13:01] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus::ops add job to scrape gitlab metrics [puppet] - 10https://gerrit.wikimedia.org/r/707860 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [12:18:49] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: allow access to cloud-cumin-* hosts as cumin_masters [puppet] - 10https://gerrit.wikimedia.org/r/708243 (https://phabricator.wikimedia.org/T287269) (owner: 10Filippo Giunchedi) [12:21:15] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [12:26:57] (03CR) 10Elukey: [C: 03+2] aptrepo: add a component for iptables to wikimedia-buster [puppet] - 10https://gerrit.wikimedia.org/r/708265 (https://phabricator.wikimedia.org/T287238) (owner: 10Elukey) [12:38:29] 10SRE, 10DynamicPageList (Wikimedia), 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Ladsgroup) Beside the aforementioned issue above (which can be mitigated to some degree), DPL has other scalability issues that need examining before taking a... [12:39:59] (03PS1) 10Elukey: aptrepo: fix typo for component/iptables185 [puppet] - 10https://gerrit.wikimedia.org/r/708283 [12:42:43] (03CR) 10Elukey: [C: 03+2] aptrepo: fix typo for component/iptables185 [puppet] - 10https://gerrit.wikimedia.org/r/708283 (owner: 10Elukey) [12:43:51] !log dcausse@deploy1002 Started deploy [wikimedia/discovery/analytics@346ac10]: (no justification provided) [12:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:55] (03CR) 10JMeybohm: [C: 03+1] "Just a nit. LGTM" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/708279 (owner: 10Giuseppe Lavagetto) [12:50:05] !log dcausse@deploy1002 Finished deploy [wikimedia/discovery/analytics@346ac10]: (no justification provided) (duration: 06m 13s) [12:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:39] !log created component/iptables185 for buster-wikimedia + imported packages from buster-backports [12:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:56] 10SRE, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Papaul) @Vgutierrez the only limitation i cans see is the number of NIC ports on each server. Each server has 4 NIC's each NIC connected to 1 row on 1 swi... [13:03:34] 10SRE, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Vgutierrez) @Papaul nope.. the idea would be replace some of the current links with new ones to additional switches [13:04:47] (03CR) 10Ottomata: Use admin module to manage system user for use by human users (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) (owner: 10Ottomata) [13:08:53] (03PS5) 10Elukey: profile::kubernetes::node: add component/iptables for Buster [puppet] - 10https://gerrit.wikimedia.org/r/708258 (https://phabricator.wikimedia.org/T287238) [13:11:18] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 4 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30355/console" [puppet] - 10https://gerrit.wikimedia.org/r/708258 (https://phabricator.wikimedia.org/T287238) (owner: 10Elukey) [13:13:33] 10SRE, 10DynamicPageList (Wikimedia), 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Elitre) Which of the four options described at https://www.mediawiki.org/wiki/Extension:DynamicPageList is this task about, please? [13:13:44] (03PS1) 10Jbond: admin::user: add support for nonexistent home directory [puppet] - 10https://gerrit.wikimedia.org/r/708288 (https://phabricator.wikimedia.org/T287063) [13:14:17] 10SRE, 10DynamicPageList (Wikimedia), 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Ladsgroup) https://www.mediawiki.org/wiki/Extension:DynamicPageList_(Wikimedia) [13:14:30] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30356/console" [puppet] - 10https://gerrit.wikimedia.org/r/708288 (https://phabricator.wikimedia.org/T287063) (owner: 10Jbond) [13:16:20] 10SRE, 10DynamicPageList (Wikimedia), 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Elitre) Thanks. The page I linked to seems to imply that a couple options there are perhaps more recent and/or actively maintained? Since I don't know or unde... [13:16:33] (03PS1) 10Ottomata: eventgate-analytics - Use canary releases for eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/708289 (https://phabricator.wikimedia.org/T272714) [13:16:35] 10SRE, 10DynamicPageList (Wikimedia), 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Elitre) [13:17:31] 10SRE, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Papaul) @Vgutierrez if my understanding is right you want for example lvsX NIC 1 to switch asw-a2 NIC 2 to switch asw-a7 (2 switches in ROW A) and NIC 3 to... [13:19:00] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics - Use canary releases for eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/708289 (https://phabricator.wikimedia.org/T272714) (owner: 10Ottomata) [13:19:52] (03CR) 10Jelto: [V: 03+1 C: 03+2] prometheus::ops add job to scrape gitlab metrics [puppet] - 10https://gerrit.wikimedia.org/r/707860 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [13:20:30] 10SRE, 10DynamicPageList (Wikimedia), 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Ladsgroup) If you mean that we replace it with another DPL extension, this has been looked at and those extensions seems to be even worse than the one current... [13:20:52] (03PS6) 10Elukey: profile::kubernetes::node: add component/iptables for Buster [puppet] - 10https://gerrit.wikimedia.org/r/708258 (https://phabricator.wikimedia.org/T287238) [13:21:24] (03CR) 10jerkins-bot: [V: 04-1] profile::kubernetes::node: add component/iptables for Buster [puppet] - 10https://gerrit.wikimedia.org/r/708258 (https://phabricator.wikimedia.org/T287238) (owner: 10Elukey) [13:21:28] (03PS2) 10Jelto: prometheus::ops add job to scrape gitlab metrics [puppet] - 10https://gerrit.wikimedia.org/r/707860 (https://phabricator.wikimedia.org/T275170) [13:22:15] (03PS7) 10Btullis: Enable kerberos ticket auto-renewal for a test client [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) [13:22:25] (03PS7) 10Elukey: profile::kubernetes::node: add component/iptables for Buster [puppet] - 10https://gerrit.wikimedia.org/r/708258 (https://phabricator.wikimedia.org/T287238) [13:23:04] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30360/console" [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [13:24:24] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 4 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30361/console" [puppet] - 10https://gerrit.wikimedia.org/r/708258 (https://phabricator.wikimedia.org/T287238) (owner: 10Elukey) [13:25:58] (03PS1) 10Muehlenhoff: Extend access for kaywong [puppet] - 10https://gerrit.wikimedia.org/r/708290 [13:26:59] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for kaywong [puppet] - 10https://gerrit.wikimedia.org/r/708290 (owner: 10Muehlenhoff) [13:29:39] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [13:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:49] (03CR) 10Elukey: "I have some generic questions about the approach, just to understand where we are going :)" [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [13:30:21] !log deploying eventgate-analytics with native prometheus support. Doing this slowly on canary release first to ensure https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-07-14_eventgate-analytics_latency_spike_caused_MW_app_server_overload is fixed. [13:30:26] 10SRE, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Vgutierrez) the current problem is that both lvs2007 (primary for high-traffic1) and lvs2010 (secondary) get row A traffic from the very same switch, so if... [13:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:33] (03CR) 10Btullis: "I think that this is ready to merge, in order to start testing auto-renewal functionality on an-test-client1001. It should be a noop for a" [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [13:30:49] (03PS10) 10Btullis: Update TLS configuration for analytics-test-presto [puppet] - 10https://gerrit.wikimedia.org/r/706661 (https://phabricator.wikimedia.org/T273642) [13:32:14] (03PS1) 10Jelto: prometheus::ops fix gitlab rails metric path [puppet] - 10https://gerrit.wikimedia.org/r/708291 (https://phabricator.wikimedia.org/T275170) [13:32:21] (03CR) 10Giuseppe Lavagetto: mediawiki: use pod name as value of the Server header (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/708279 (owner: 10Giuseppe Lavagetto) [13:32:47] (03PS2) 10Giuseppe Lavagetto: mediawiki: use pod name as value of the Server header [deployment-charts] - 10https://gerrit.wikimedia.org/r/708279 [13:33:15] (03CR) 10Btullis: "I think that this is ready to merge in order to start testing this functionality with Presto in the test cluster." [puppet] - 10https://gerrit.wikimedia.org/r/706661 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [13:34:25] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [13:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:45] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [13:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:10] (03CR) 10Jelto: "I forgot to also update the metric path of GitLab rails exporter to /metrics when using a standalone metrics exporter. Currently Prometheu" [puppet] - 10https://gerrit.wikimedia.org/r/708291 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [13:39:59] !log volans@deploy1002 Started deploy [netbox/deploy@660ad14]: Deploy v2.10.4-wmf5 [13:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:19] (03CR) 10Muehlenhoff: "I like this! One suggestion inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/708288 (https://phabricator.wikimedia.org/T287063) (owner: 10Jbond) [13:40:22] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [13:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:40] ^^ is the deploy that caused the 10 minute API outage 2 weeks ago. Should be fixed and is working fine with traffic in the canary pod. Just FYI all! [13:42:13] looking healthy so far [13:42:28] !log volans@deploy1002 Finished deploy [netbox/deploy@660ad14]: Deploy v2.10.4-wmf5 (duration: 02m 29s) [13:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:47] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus::ops fix gitlab rails metric path [puppet] - 10https://gerrit.wikimedia.org/r/708291 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [13:44:36] (03CR) 10Jelto: [C: 03+2] prometheus::ops fix gitlab rails metric path [puppet] - 10https://gerrit.wikimedia.org/r/708291 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [13:44:45] (03CR) 10Ottomata: [C: 03+1] "Nice, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/708288 (https://phabricator.wikimedia.org/T287063) (owner: 10Jbond) [13:45:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:46:16] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 225, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:46:32] ^ 👀 fix is progress [13:46:56] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:46:57] (03CR) 10Muehlenhoff: "> Patch Set 7:" [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [13:47:39] PROBLEM - tilerator on maps1009 is CRITICAL: connect to address 10.64.32.8 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [13:49:31] (03PS1) 10Ottomata: eventgate - use native prometheus in all services [deployment-charts] - 10https://gerrit.wikimedia.org/r/708292 (https://phabricator.wikimedia.org/T272714) [13:49:33] PROBLEM - tilerator on maps1007 is CRITICAL: connect to address 10.64.16.6 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [13:50:24] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate - use native prometheus in all services [deployment-charts] - 10https://gerrit.wikimedia.org/r/708292 (https://phabricator.wikimedia.org/T272714) (owner: 10Ottomata) [13:51:29] RECOVERY - tilerator on maps1009 is OK: HTTP OK: HTTP/1.1 200 OK - 322 bytes in 0.026 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [13:52:17] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [13:52:19] (03CR) 10Btullis: "> Patch Set 7:" [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [13:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:53] 10SRE, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Papaul) Ok understood. Please provide me with the configuration you want in a table like above for each server which NIC connects to which switch and i can... [13:52:54] (03CR) 10Andrew Bogott: [C: 03+2] Install tmpreaper on quarry web hosts, clean up temp files 4+ days idle [puppet] - 10https://gerrit.wikimedia.org/r/708150 (https://phabricator.wikimedia.org/T238375) (owner: 10Andrew Bogott) [13:54:46] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [13:54:46] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [13:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:40] (03CR) 10Btullis: "> Patch Set 7:" [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [13:56:47] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:57:13] RECOVERY - tilerator on maps1007 is OK: HTTP OK: HTTP/1.1 200 OK - 322 bytes in 0.026 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [13:59:08] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [13:59:08] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [13:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:18] (03CR) 10Btullis: "Against which hosts should I run a puppet-compiler test to check the DNS discovery settings?" [puppet] - 10https://gerrit.wikimedia.org/r/706661 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [14:00:57] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [14:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:28] (03PS1) 10Jbond: admin::user: pass home_dir to the user object [puppet] - 10https://gerrit.wikimedia.org/r/708294 [14:03:50] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [14:03:51] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [14:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:31] (03PS2) 10Jbond: admin::user: add support for nonexistent home directory [puppet] - 10https://gerrit.wikimedia.org/r/708288 (https://phabricator.wikimedia.org/T287063) [14:07:06] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30363/console" [puppet] - 10https://gerrit.wikimedia.org/r/708294 (owner: 10Jbond) [14:07:18] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [14:07:18] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [14:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:49] (03CR) 10Elukey: [V: 03+1 C: 03+2] "Since this affects only buster nodes, and hence ml-ones, I am doing to merge to test the changes etc.." [puppet] - 10https://gerrit.wikimedia.org/r/708258 (https://phabricator.wikimedia.org/T287238) (owner: 10Elukey) [14:10:14] (03CR) 10Muehlenhoff: [C: 03+1] "Good catch!" [puppet] - 10https://gerrit.wikimedia.org/r/708294 (owner: 10Jbond) [14:10:38] (03CR) 10Herron: [C: 03+1] Fix non-test yaml file globbing [alerts] - 10https://gerrit.wikimedia.org/r/708266 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [14:11:15] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [14:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:50] !log installing aspell security updates [14:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:27] (03CR) 10Herron: [C: 03+1] Import alerts for Thanos rule [alerts] - 10https://gerrit.wikimedia.org/r/708280 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [14:13:30] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [14:13:30] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [14:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:29] (03PS3) 10Jbond: admin::user: add support for nonexistent home directory [puppet] - 10https://gerrit.wikimedia.org/r/708288 (https://phabricator.wikimedia.org/T287063) [14:15:13] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30364/console" [puppet] - 10https://gerrit.wikimedia.org/r/708288 (https://phabricator.wikimedia.org/T287063) (owner: 10Jbond) [14:15:48] (03CR) 10Jbond: [V: 03+1 C: 03+2] admin::user: pass home_dir to the user object [puppet] - 10https://gerrit.wikimedia.org/r/708294 (owner: 10Jbond) [14:15:50] (03CR) 10Ottomata: "> Patch Set 10:" [puppet] - 10https://gerrit.wikimedia.org/r/706661 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [14:15:52] (03CR) 10Herron: [C: 03+1] pontoon: wait for puppetdb to be up before enabling it [puppet] - 10https://gerrit.wikimedia.org/r/708033 (owner: 10Filippo Giunchedi) [14:16:34] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [14:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:51] (03CR) 10Herron: [C: 03+1] rsyslog: add gitlab input-file entries to lookup table [puppet] - 10https://gerrit.wikimedia.org/r/708160 (https://phabricator.wikimedia.org/T274462) (owner: 10Cwhite) [14:17:39] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30365/console" [puppet] - 10https://gerrit.wikimedia.org/r/708288 (https://phabricator.wikimedia.org/T287063) (owner: 10Jbond) [14:18:11] (03CR) 10Jbond: [V: 03+1] "updated thanks" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/708288 (https://phabricator.wikimedia.org/T287063) (owner: 10Jbond) [14:19:15] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [14:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:56] (03PS1) 10Elukey: profile::kubernetes::node: add netbase to iptables dependencies [puppet] - 10https://gerrit.wikimedia.org/r/708297 (https://phabricator.wikimedia.org/T287238) [14:21:35] (03CR) 10Elukey: [C: 03+2] profile::kubernetes::node: add netbase to iptables dependencies [puppet] - 10https://gerrit.wikimedia.org/r/708297 (https://phabricator.wikimedia.org/T287238) (owner: 10Elukey) [14:23:51] PROBLEM - tilerator on maps1009 is CRITICAL: connect to address 10.64.32.8 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [14:23:53] PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:49] (03CR) 10Jbond: [C: 03+1] "Puppet wise LGTM, although can't comment on the hadoop side of things" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) (owner: 10Ottomata) [14:25:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1129 T287230', diff saved to https://phabricator.wikimedia.org/P16916 and previous config saved to /var/cache/conftool/dbconfig/20210727-142520-marostegui.json [14:25:22] (03PS4) 10Jbond: admin::user: add support for nonexistent home directory [puppet] - 10https://gerrit.wikimedia.org/r/708288 (https://phabricator.wikimedia.org/T287063) [14:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:28] T287230: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 [14:26:06] (03CR) 10Ottomata: Use admin module to manage system user for use by human users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) (owner: 10Ottomata) [14:26:46] (03PS1) 10Marostegui: db1129: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/708299 (https://phabricator.wikimedia.org/T287230) [14:26:55] (03CR) 10Jbond: [C: 03+1] Use admin module to manage system user for use by human users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) (owner: 10Ottomata) [14:27:30] (03CR) 10Marostegui: [C: 03+2] db1129: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/708299 (https://phabricator.wikimedia.org/T287230) (owner: 10Marostegui) [14:28:35] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve-ctrl2001.codfw.wmnet [14:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:19] !log reduce vcores for ml-serve-ctrl[12]00[12] after performance testing - T287238 [14:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:26] T287238: ML Serve controller vms show a slowly increasing resource usage leak over time - https://phabricator.wikimedia.org/T287238 [14:29:46] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30366/console" [puppet] - 10https://gerrit.wikimedia.org/r/706661 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [14:30:20] !log Add peering to AS398196 - Cobalt Ridge at DE-CIX Dallas on cr2-codfw. [14:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:40] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve-ctrl2001.codfw.wmnet [14:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:56] !log depool cp10[79-82]).eqiad.wmnet - T286061 [14:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:02] T286061: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 [14:34:07] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve-ctrl2002.codfw.wmnet [14:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:36] (03CR) 10Jbond: "> Patch Set 10:" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/706661 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [14:34:41] RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:25] 10SRE, 10Data-Persistence-Backup, 10bacula: Restore ~tjones/reindex directory from mwmaint1002 - https://phabricator.wikimedia.org/T287304 (10TJones) Thanks, @jcrespo! It looks like everything I need is there. [14:36:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [14:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:32] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Vgutierrez) [14:40:10] 10SRE, 10Analytics, 10Traffic, 10Patch-For-Review: Downloading from Archiva.wikimedia.org seems slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10hashar) The performance are currently severely degraded, seems each request made to archiva has a 3-4 seconds delay before starting the tr... [14:40:19] !log elukey@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ml-serve-ctrl2002.codfw.wmnet [14:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:28] manually terminated -^ [14:40:30] !log depool authdns1001 - T286061 [14:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:37] T286061: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 [14:40:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1129.eqiad.wmnet with reason: REIMAGE [14:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [14:41:27] elukey: https://doc.wikimedia.org/spicerack/master/api/spicerack.cookbook.html#spicerack.cookbook.INTERRUPTED_RETCODE :-P [14:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:59] volans: ? [14:42:11] the interrupted exit code (exit_code=97) :D [14:42:23] manually interrupted [14:42:38] sure but not all people check the spicerack docs every time :D [14:42:51] I know, was just joking with you :-) [14:42:57] sure sure [14:43:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1129.eqiad.wmnet with reason: REIMAGE [14:43:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:00] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [14:45:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [14:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:24] (03CR) 10Cwhite: [C: 03+1] global: add a simple requires.txt [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/707256 (owner: 10David Caro) [14:46:57] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on cp[1079-1082].eqiad.wmnet with reason: Eqiad row B maintenance [14:46:59] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cp[1079-1082].eqiad.wmnet with reason: Eqiad row B maintenance [14:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:14] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10ops-monitoring-bot) Icinga downtime set by mmandere@cumin1001 for 1:00:00 4 host(s) and their services with reason: Eqiad row B maintenance ` c... [14:47:37] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:47:57] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2001.codfw.wmnet [14:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:09] PROBLEM - Bird Internet Routing Daemon on authdns1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:48:23] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:48:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [14:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:45] ^^ that's expected from authdns1001 depool [14:49:44] (03CR) 10Jbond: "LGTM couple of nits" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/708033 (owner: 10Filippo Giunchedi) [14:51:21] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on authdns1001.wikimedia.org with reason: Eqiad row B maintenance [14:51:22] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on authdns1001.wikimedia.org with reason: Eqiad row B maintenance [14:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:36] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10ops-monitoring-bot) Icinga downtime set by mmandere@cumin1001 for 1:00:00 1 host(s) and their services with reason: Eqiad row B maintenance ` a... [14:52:18] 10SRE, 10ops-eqiad: Degraded RAID on db1129 - https://phabricator.wikimedia.org/T287481 (10ops-monitoring-bot) [14:52:33] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Vgutierrez) [14:52:34] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2001.codfw.wmnet [14:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:50] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2002.codfw.wmnet [14:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:57] !log depool lvs1014 - T286061 [14:52:59] !log disabling puppet for upcoming row B maintenance [14:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:03] T286061: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 [14:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1162 T287230', diff saved to https://phabricator.wikimedia.org/P16917 and previous config saved to /var/cache/conftool/dbconfig/20210727-145352-marostegui.json [14:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:00] T287230: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 [14:54:10] (03PS10) 10Clare Ming: Enable user links on office + test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708152 (https://phabricator.wikimedia.org/T287391) [14:54:28] (03PS1) 10Marostegui: Revert "db1162: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/708213 [14:54:33] (03CR) 10Btullis: [V: 03+1] Update TLS configuration for analytics-test-presto (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/706661 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [14:55:15] (03CR) 10Marostegui: [C: 03+2] Revert "db1162: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/708213 (owner: 10Marostegui) [14:55:15] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on lvs1014.eqiad.wmnet with reason: Eqiad row B maintenance [14:55:16] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lvs1014.eqiad.wmnet with reason: Eqiad row B maintenance [14:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:32] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10ops-monitoring-bot) Icinga downtime set by mmandere@cumin1001 for 1:00:00 1 host(s) and their services with reason: Eqiad row B maintenance ` l... [14:56:00] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Vgutierrez) [14:57:12] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2002.codfw.wmnet [14:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:35] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Marostegui) [14:58:15] 10SRE, 10ops-eqiad: Degraded RAID on db1129 - https://phabricator.wikimedia.org/T287481 (10RhinosF1) 05Open→03Invalid @Marostegui was doing a reimage [14:58:18] 10SRE, 10ops-eqiad: Degraded RAID on db1129 - https://phabricator.wikimedia.org/T287481 (10Marostegui) 05Invalid→03Declined This was due to a reimage, it is being tracked at: T285715 [15:01:25] 10SRE, 10Services, 10Wikibase-Quality-Constraints, 10Wikidata, and 3 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Joe) I think there are two options, depending on the level of security we want to achieve and the urgen... [15:04:03] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:04:28] ^ authdns depool or something else? [15:04:36] yep [15:04:41] yup, authdns1001 [15:05:03] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ldap-replica1003.wikimedia.org [15:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:46] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: use pod name as value of the Server header [deployment-charts] - 10https://gerrit.wikimedia.org/r/708279 (owner: 10Giuseppe Lavagetto) [15:08:18] (03PS1) 10Marostegui: Revert "wmnet: Switchover m1-master from dbproxy1014 to dbproxy1012" [dns] - 10https://gerrit.wikimedia.org/r/708214 [15:08:29] (03PS2) 10Marostegui: Revert "wmnet: Switchover m1-master from dbproxy1014 to dbproxy1012" [dns] - 10https://gerrit.wikimedia.org/r/708214 [15:08:59] 10SRE, 10Data-Persistence-Backup, 10bacula: Restore ~tjones/reindex directory from mwmaint1002 - https://phabricator.wikimedia.org/T287304 (10jcrespo) 05Open→03Resolved [15:09:11] !log pool cp10[79-82].eqiad.wmnet - T286061 [15:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:17] T286061: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 [15:09:36] (03Merged) 10jenkins-bot: mediawiki: use pod name as value of the Server header [deployment-charts] - 10https://gerrit.wikimedia.org/r/708279 (owner: 10Giuseppe Lavagetto) [15:10:11] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10MoritzMuehlenhoff) [15:10:37] (03CR) 10Marostegui: [C: 03+2] Revert "wmnet: Switchover m1-master from dbproxy1014 to dbproxy1012" [dns] - 10https://gerrit.wikimedia.org/r/708214 (owner: 10Marostegui) [15:10:47] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2003.codfw.wmnet [15:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:24] !log pool authdns1001.wikimedia.org - T286061 [15:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:56] !log Move m1-master from dbproxy1012 to dbproxy1014 T286061 [15:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:13] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:12:36] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Marostegui) >>! In T286061#7231980, @Marostegui wrote: > m1-master.eqiad.wmnet switched over to dbproxy1012 which is on row A. Once this row is... [15:12:37] (03CR) 10Bstorm: [C: 03+2] lighttpd surprises: we don't get python2 in php images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/708187 (https://phabricator.wikimedia.org/T287421) (owner: 10Bstorm) [15:12:41] RECOVERY - Bird Internet Routing Daemon on authdns1001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [15:13:37] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:15:12] (03Merged) 10jenkins-bot: lighttpd surprises: we don't get python2 in php images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/708187 (https://phabricator.wikimedia.org/T287421) (owner: 10Bstorm) [15:16:08] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) [15:16:26] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2003.codfw.wmnet [15:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:42] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2004.codfw.wmnet [15:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:07] !log pool lvs1014.eqiad.wmnet - T286061 [15:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:13] T286061: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 [15:18:20] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) [15:18:27] (03CR) 10Cwhite: [C: 03+2] logstash: complete restbase transition to ECS [puppet] - 10https://gerrit.wikimedia.org/r/705729 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [15:18:29] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 77, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:18:39] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Bstorm) [15:19:16] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Bstorm) [15:19:24] 10SRE, 10Infrastructure-Foundations, 10netops: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [15:20:56] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [15:21:33] (03PS1) 10Lucas Werkmeister (WMDE): Stop setting $wgWBClientSettings['repoDatabase'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708308 (https://phabricator.wikimedia.org/T257260) [15:21:35] (03PS1) 10Lucas Werkmeister (WMDE): Remove $wmgWikibaseClientRepoDatabase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708309 (https://phabricator.wikimedia.org/T257260) [15:21:52] PROBLEM - ElasticSearch health check for shards on 9200 on logstash2021 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fbf23fe12e8: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech [15:21:52] ia.org/wiki/Search%23Administration [15:22:05] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2004.codfw.wmnet [15:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:15] !log cirrus: reindexing 823 wikis in elastic@[eqiad, codfw and cloudelastic] to apply new mapping (weighted_tags) T147505 [15:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:21] T147505: [tracking] CirrusSearch: what is updated during re-indexing - https://phabricator.wikimedia.org/T147505 [15:22:56] RECOVERY - ElasticSearch health check for shards on 9200 on logstash2021 is OK: OK - elasticsearch status production-elk7-codfw: cluster_name: production-elk7-codfw, status: green, timed_out: False, number_of_nodes: 14, number_of_data_nodes: 9, active_primary_shards: 513, active_shards: 1196, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fe [15:22:56] task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:23:24] (03CR) 10Jbond: [C: 03+1] "LGTM but im not super confident about the service definition (have made my own mistakes before :))" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/706661 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [15:25:37] (03CR) 10Ahmon Dancy: [C: 03+1] gerrit: remove unused container.javaOptions values [puppet] - 10https://gerrit.wikimedia.org/r/708104 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [15:25:54] (03PS1) 10Herron: logstash: remove references to logstash202[012] [puppet] - 10https://gerrit.wikimedia.org/r/708311 (https://phabricator.wikimedia.org/T281266) [15:26:54] (03CR) 10Ahmon Dancy: [C: 03+1] gerrit: remove unused settings from [container] [puppet] - 10https://gerrit.wikimedia.org/r/708103 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [15:28:16] PROBLEM - Host ml-serve-ctrl1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:31] this is me, late in downtiming --^ [15:34:25] (03CR) 10Filippo Giunchedi: [C: 03+2] Fix non-test yaml file globbing [alerts] - 10https://gerrit.wikimedia.org/r/708266 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [15:34:31] (03CR) 10Filippo Giunchedi: [C: 03+2] Import alerts for Thanos rule [alerts] - 10https://gerrit.wikimedia.org/r/708280 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [15:37:49] (03CR) 10Filippo Giunchedi: "LGTM overall, removing entries from modules/install_server/files/autoinstall/netboot.cfg is missing though" [puppet] - 10https://gerrit.wikimedia.org/r/708311 (https://phabricator.wikimedia.org/T281266) (owner: 10Herron) [15:39:24] (03PS2) 10Herron: logstash: remove references to logstash202[012] [puppet] - 10https://gerrit.wikimedia.org/r/708311 (https://phabricator.wikimedia.org/T281266) [15:39:48] RECOVERY - Host ml-serve-ctrl1001 is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [15:39:51] (03CR) 10Herron: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/708311 (https://phabricator.wikimedia.org/T281266) (owner: 10Herron) [15:41:03] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/708311 (https://phabricator.wikimedia.org/T281266) (owner: 10Herron) [15:42:25] !log add disk_template drbd back to ml-serve-ctrl100[12] vms after performance testing - T287238 [15:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:32] T287238: ML Serve controller vms show a slowly increasing resource usage leak over time - https://phabricator.wikimedia.org/T287238 [15:43:52] (03PS8) 10Jbond: profile: create in module data for profile [puppet] - 10https://gerrit.wikimedia.org/r/704333 (https://phabricator.wikimedia.org/T285539) [15:45:22] (03CR) 10jerkins-bot: [V: 04-1] profile: create in module data for profile [puppet] - 10https://gerrit.wikimedia.org/r/704333 (https://phabricator.wikimedia.org/T285539) (owner: 10Jbond) [15:49:40] 10SRE, 10Services, 10Wikibase-Quality-Constraints, 10Wikidata, and 3 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Ladsgroup) >>! In T285104#7239822, @Joe wrote: > I think there are two options, depending on the level... [15:49:42] (03PS9) 10Jbond: profile: create in module data for profile [puppet] - 10https://gerrit.wikimedia.org/r/704333 (https://phabricator.wikimedia.org/T285539) [15:49:53] (03PS1) 10Ottomata: Move airflow-analytics-test instance to an-test-client1001 [puppet] - 10https://gerrit.wikimedia.org/r/708314 (https://phabricator.wikimedia.org/T285692) [15:50:18] (03CR) 10jerkins-bot: [V: 04-1] profile: create in module data for profile [puppet] - 10https://gerrit.wikimedia.org/r/704333 (https://phabricator.wikimedia.org/T285539) (owner: 10Jbond) [15:51:33] (03PS10) 10Jbond: profile: create in module data for profile [puppet] - 10https://gerrit.wikimedia.org/r/704333 (https://phabricator.wikimedia.org/T285539) [15:52:12] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30370/console" [puppet] - 10https://gerrit.wikimedia.org/r/708314 (https://phabricator.wikimedia.org/T285692) (owner: 10Ottomata) [15:52:17] (03CR) 10jerkins-bot: [V: 04-1] profile: create in module data for profile [puppet] - 10https://gerrit.wikimedia.org/r/704333 (https://phabricator.wikimedia.org/T285539) (owner: 10Jbond) [15:52:19] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30371/console" [puppet] - 10https://gerrit.wikimedia.org/r/704333 (https://phabricator.wikimedia.org/T285539) (owner: 10Jbond) [15:53:15] (03PS11) 10Jbond: profile: create in module data for profile [puppet] - 10https://gerrit.wikimedia.org/r/704333 (https://phabricator.wikimedia.org/T285539) [15:53:58] (03PS12) 10Jbond: profile: create in module data for profile [puppet] - 10https://gerrit.wikimedia.org/r/704333 (https://phabricator.wikimedia.org/T285539) [15:54:06] (03CR) 10jerkins-bot: [V: 04-1] profile: create in module data for profile [puppet] - 10https://gerrit.wikimedia.org/r/704333 (https://phabricator.wikimedia.org/T285539) (owner: 10Jbond) [15:54:09] (03CR) 10Ottomata: [V: 03+1 C: 03+2] Move airflow-analytics-test instance to an-test-client1001 [puppet] - 10https://gerrit.wikimedia.org/r/708314 (https://phabricator.wikimedia.org/T285692) (owner: 10Ottomata) [15:54:36] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10Joe) a:03Joe Status update: the mwdebug installation is now reachable from external users via the Wikimedia Debug browser extensio... [15:54:49] (03CR) 10jerkins-bot: [V: 04-1] profile: create in module data for profile [puppet] - 10https://gerrit.wikimedia.org/r/704333 (https://phabricator.wikimedia.org/T285539) (owner: 10Jbond) [16:00:04] jbond42 and rzl: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210727T1600). [16:01:26] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 121 probes of 628 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:01:49] (03PS1) 10JMeybohm: rdf-streaming-updater: Allow egress to kubernetes api servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/708315 (https://phabricator.wikimedia.org/T287443) [16:04:58] (03CR) 10Brennen Bearnes: [C: 03+1] "LGTM - feel free to merge and deploy at will." [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/708275 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [16:06:48] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 37 probes of 628 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:09:05] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: Allow egress to kubernetes api servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/708315 (https://phabricator.wikimedia.org/T287443) (owner: 10JMeybohm) [16:10:52] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10Joe) [16:12:01] (03Merged) 10jenkins-bot: rdf-streaming-updater: Allow egress to kubernetes api servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/708315 (https://phabricator.wikimedia.org/T287443) (owner: 10JMeybohm) [16:12:47] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10Joe) [16:14:03] !log dcausse@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [16:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:31] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10Joe) [16:15:12] 10SRE, 10Traffic, 10serviceops, 10Patch-For-Review, 10User-jijiki: Access mwdebug kubernetes deployment via the 'X-Wikimedia-Debug' header - https://phabricator.wikimedia.org/T286491 (10Joe) 05Open→03Resolved a:03Joe [16:15:18] 10SRE, 10Traffic, 10WikimediaDebug, 10Performance-Team (Radar): Allow ATS to route traffic to mwdebug deployment on kubernetes - https://phabricator.wikimedia.org/T286482 (10Joe) [16:15:58] 10SRE, 10Traffic, 10serviceops, 10Patch-For-Review, 10User-jijiki: Access mwdebug kubernetes deployment via the 'X-Wikimedia-Debug' header - https://phabricator.wikimedia.org/T286491 (10Joe) [16:21:01] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1001.eqiad.wmnet [16:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:50] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 154 probes of 628 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:25:05] (03CR) 10Btullis: "This patch has previously had CR+1 but I edited it to address comments." [cookbooks] - 10https://gerrit.wikimedia.org/r/705869 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [16:25:54] (03CR) 10Btullis: "Adding Razzi as an additional reviewer." [cookbooks] - 10https://gerrit.wikimedia.org/r/704932 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [16:26:14] (03PS11) 10Btullis: Update TLS configuration for analytics-test-presto [puppet] - 10https://gerrit.wikimedia.org/r/706661 (https://phabricator.wikimedia.org/T273642) [16:26:30] (03PS8) 10Btullis: Enable kerberos ticket auto-renewal for a test client [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) [16:26:36] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1001.eqiad.wmnet [16:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:57] 10SRE, 10MW-on-K8s, 10serviceops, 10User-jijiki: Create a variant of mediawiki-multiversion which installs php-tideways-xhprof - https://phabricator.wikimedia.org/T287495 (10Joe) [16:27:22] 10SRE, 10MW-on-K8s, 10serviceops, 10User-jijiki: Create a variant of mediawiki-multiversion which installs php-tideways-xhprof - https://phabricator.wikimedia.org/T287495 (10Joe) p:05Triage→03Medium a:05Joe→03dancy [16:27:44] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 35 probes of 628 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:29:42] RECOVERY - tilerator on maps1009 is OK: HTTP OK: HTTP/1.1 200 OK - 322 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [16:31:42] (03PS3) 10Herron: logstash: remove references to logstash202[012] [puppet] - 10https://gerrit.wikimedia.org/r/708311 (https://phabricator.wikimedia.org/T281266) [16:34:59] !log herron@cumin1001 START - Cookbook sre.hosts.decommission for hosts logstash2020.codfw.wmnet [16:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:33] (03CR) 10Herron: [C: 03+2] logstash: remove references to logstash202[012] [puppet] - 10https://gerrit.wikimedia.org/r/708311 (https://phabricator.wikimedia.org/T281266) (owner: 10Herron) [16:37:57] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1002.eqiad.wmnet [16:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:44] 10SRE, 10Analytics, 10Traffic: Downloading from Archiva.wikimedia.org seems slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10hashar) Note that uploading is fast. Here for a file named `service-0.3.78-dist.tar.gz` ` 01:42:07.283 [INFO] [INFO] Uploaded to archiva.releases: https://archi... [16:42:25] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1002.eqiad.wmnet [16:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:42] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1004.eqiad.wmnet [16:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:50] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts logstash2020.codfw.wmnet [16:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:12] (03CR) 10Jbond: "cloud PCC: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30373" [puppet] - 10https://gerrit.wikimedia.org/r/704333 (https://phabricator.wikimedia.org/T285539) (owner: 10Jbond) [16:47:19] !log herron@cumin1001 START - Cookbook sre.hosts.decommission for hosts logstash2021.codfw.wmnet [16:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:19] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1004.eqiad.wmnet [16:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:33] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1003.eqiad.wmnet [16:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:21] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1003.eqiad.wmnet [16:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:05] chrisalbon and accraze: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210727T1700). [17:01:47] (03Abandoned) 10Krinkle: Add k8s-experimental to the list of debug servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706701 (https://phabricator.wikimedia.org/T286491) (owner: 10Effie Mouzeli) [17:09:29] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) 05Open→03Resolved [17:09:37] 10SRE, 10Infrastructure-Foundations, 10netops: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [17:14:48] (03CR) 10Jbond: [C: 03+1] Update sre.cassandra.roll-restart cookbook to use new spicerack API [cookbooks] - 10https://gerrit.wikimedia.org/r/705869 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [17:15:52] (03CR) 10Btullis: [C: 03+2] Update sre.cassandra.roll-restart cookbook to use new spicerack API [cookbooks] - 10https://gerrit.wikimedia.org/r/705869 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [17:17:17] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts logstash2021.codfw.wmnet [17:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:17] !log herron@cumin1001 START - Cookbook sre.hosts.decommission for hosts logstash2022.codfw.wmnet [17:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:18] (03Merged) 10jenkins-bot: Update sre.cassandra.roll-restart cookbook to use new spicerack API [cookbooks] - 10https://gerrit.wikimedia.org/r/705869 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [17:28:01] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts logstash2022.codfw.wmnet [17:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:32] (03PS1) 10Legoktm: mwdebug: Add envoy proxy config for Shellbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/708324 [17:30:28] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10herron) [17:32:31] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "You also need to explicitly declare shellbox as one of the enabled listeners in values.yaml" [deployment-charts] - 10https://gerrit.wikimedia.org/r/708324 (owner: 10Legoktm) [17:34:06] (03PS2) 10Legoktm: mwdebug: Add envoy proxy config for Shellbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/708324 [17:39:24] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10herron) 05Open→03Resolved a:03herron All elk5 hardware has been decommed at this point. We have 3 Ganeti VMs per-site remaining which are needed to han... [17:40:01] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mwdebug: Add envoy proxy config for Shellbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/708324 (owner: 10Legoktm) [17:41:06] (03CR) 10Legoktm: [C: 03+2] mwdebug: Add envoy proxy config for Shellbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/708324 (owner: 10Legoktm) [17:43:35] (03Merged) 10jenkins-bot: mwdebug: Add envoy proxy config for Shellbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/708324 (owner: 10Legoktm) [17:47:45] !log legoktm@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:56] !log legoktm@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:05] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210727T1800) [18:10:01] 10SRE, 10Traffic, 10serviceops, 10Datacenter-Switchover: During DC switch, helm-charts failed verification because it doesn't have a service IP - https://phabricator.wikimedia.org/T285707 (10Legoktm) p:05Triage→03High [18:10:05] 10SRE, 10Datacenter-Switchover: switchdc cookbook should perform exponential backoff when checking DNS TTL - https://phabricator.wikimedia.org/T285800 (10Legoktm) p:05Triage→03Low [18:10:26] 10SRE, 10MW-on-K8s, 10serviceops: GitInfo is missing from mwdebug-kubernetes deployment - https://phabricator.wikimedia.org/T287512 (10Krinkle) [18:11:29] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:12:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:14:48] 10SRE, 10MW-on-K8s, 10serviceops: GitInfo is missing from mwdebug-kubernetes deployment - https://phabricator.wikimedia.org/T287512 (10Legoktm) p:05Triage→03Medium [18:17:19] !log MediaWiki branch `1.37.0-wmf.16` prepped and patched in preparation for the upcoming deployment window. [18:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:26] (03PS3) 10Jdlrobson: Disable mobile contributions simplifications on Wikidata and Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708158 (https://phabricator.wikimedia.org/T283988) [18:24:55] (03CR) 10jerkins-bot: [V: 04-1] Disable mobile contributions simplifications on Wikidata and Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708158 (https://phabricator.wikimedia.org/T283988) (owner: 10Jdlrobson) [18:26:11] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:31:55] PROBLEM - Host wdqs1013 is DOWN: PING CRITICAL - Packet loss = 100% [18:32:39] RECOVERY - Host wdqs1013 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [18:34:14] (03PS1) 10Ottomata: Migrate EchoMail and EchoInteraction to EventGate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708331 (https://phabricator.wikimedia.org/T287210) [18:34:59] (03CR) 10jerkins-bot: [V: 04-1] Migrate EchoMail and EchoInteraction to EventGate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708331 (https://phabricator.wikimedia.org/T287210) (owner: 10Ottomata) [18:36:02] (03PS2) 10Ottomata: Migrate EchoMail and EchoInteraction to EventGate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708331 (https://phabricator.wikimedia.org/T287210) [18:40:16] (03CR) 10Ottomata: "EchoMail looks like it is a PHP side event, so this will be a no-op for that. We'll have to do a Echo extension code deploy to fully migr" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708331 (https://phabricator.wikimedia.org/T287210) (owner: 10Ottomata) [18:41:23] (03PS3) 10Ottomata: Migrate EchoMail and EchoInteraction to EventGate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708331 (https://phabricator.wikimedia.org/T287210) [18:44:05] (03PS4) 10Ottomata: Migrate EchoMail and EchoInteraction to EventGate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708331 (https://phabricator.wikimedia.org/T287210) [18:46:28] (03CR) 10Ottomata: [C: 03+2] Migrate EchoMail and EchoInteraction to EventGate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708331 (https://phabricator.wikimedia.org/T287210) (owner: 10Ottomata) [18:49:54] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Migrate EchoMail and EchoInteraction to EventGate - T287210 (duration: 02m 28s) [18:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:02] T287210: EchoMail and EchoInteraction Event Platform Migration - https://phabricator.wikimedia.org/T287210 [18:53:15] (03CR) 10Jdlrobson: tests: Assert that wgUse* flags are boolean (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578540 (owner: 10Jforrester) [18:56:31] (03CR) 10Jdlrobson: tests: Assert that wgUse* flags are boolean (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578540 (owner: 10Jforrester) [18:58:44] (03PS4) 10Jdlrobson: Disable mobile contributions simplifications on Wikidata and Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708158 (https://phabricator.wikimedia.org/T283988) [19:00:05] twentyafterfour and hashar: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - American+European Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210727T1900). [19:02:38] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:10:19] (03PS4) 10Ottomata: Revert "Revert "kafka - Use hardened_tls instead of java::security" [puppet] - 10https://gerrit.wikimedia.org/r/707025 [19:11:35] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30374/console" [puppet] - 10https://gerrit.wikimedia.org/r/707025 (owner: 10Ottomata) [19:13:29] (03CR) 10Ottomata: [V: 03+1] "Noop, just puppet parameter changes" [puppet] - 10https://gerrit.wikimedia.org/r/707025 (owner: 10Ottomata) [19:13:31] (03CR) 10Ottomata: [V: 03+1 C: 03+2] Revert "Revert "kafka - Use hardened_tls instead of java::security" [puppet] - 10https://gerrit.wikimedia.org/r/707025 (owner: 10Ottomata) [19:20:05] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:23:51] PROBLEM - Disk space on urldownloader2002 is CRITICAL: DISK CRITICAL - free space: / 340 MB (3% inode=85%): /tmp 340 MB (3% inode=85%): /var/tmp 340 MB (3% inode=85%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=urldownloader2002&var-datasource=codfw+prometheus/ops [19:25:09] (03PS1) 1020after4: testwikis wikis to 1.37.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708337 [19:25:11] (03CR) 1020after4: [C: 03+2] testwikis wikis to 1.37.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708337 (owner: 1020after4) [19:26:07] (03Merged) 10jenkins-bot: testwikis wikis to 1.37.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708337 (owner: 1020after4) [19:26:12] !log twentyafterfour@deploy1002 Started scap: testwikis wikis to 1.37.0-wmf.16 [19:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:17] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:38:23] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:02:19] !log twentyafterfour@deploy1002 Finished scap: testwikis wikis to 1.37.0-wmf.16 (duration: 36m 06s) [20:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:51] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:08:26] (03PS1) 10Dduvall: gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504) [20:08:35] PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:08:53] (03CR) 10jerkins-bot: [V: 04-1] gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504) (owner: 10Dduvall) [20:09:11] PROBLEM - tilerator on maps1009 is CRITICAL: connect to address 10.64.32.8 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [20:09:57] (03PS2) 10Dduvall: gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504) [20:10:25] (03CR) 10jerkins-bot: [V: 04-1] gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504) (owner: 10Dduvall) [20:16:38] (03PS3) 10Dduvall: gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504) [20:21:20] (03PS4) 10Dduvall: gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504) [20:23:25] (03PS1) 1020after4: group0 wikis to 1.37.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708344 [20:23:26] (03CR) 1020after4: [C: 03+2] group0 wikis to 1.37.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708344 (owner: 1020after4) [20:24:14] (03Merged) 10jenkins-bot: group0 wikis to 1.37.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708344 (owner: 1020after4) [20:25:35] !log twentyafterfour@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.37.0-wmf.16 [20:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:35] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:28:38] PROBLEM - ensure kvm processes are running on cloudvirt-wdqs1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:30:02] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt-wdqs1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 andrew bogott Im rebuilding this canary VM https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:30:57] hm, wtp1025 has WARNING: opcache cache-hit ratio is below 99.99% [20:31:02] it's 99.989 right now [20:32:15] hmm could be deployment related? [20:33:13] icinga says all the other wtp servers are fine, and this is in eqiad so it's pretty minor [20:33:26] also this alert has been on for ~10h now [20:34:11] RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:34:42] RECOVERY - tilerator on maps1009 is OK: HTTP OK: HTTP/1.1 200 OK - 322 bytes in 0.017 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [20:35:57] !log twentyafterfour@deploy1002 Pruned MediaWiki: 1.37.0-wmf.14 (duration: 03m 12s) [20:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:38] RECOVERY - ensure kvm processes are running on cloudvirt-wdqs1003 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:43:59] !log ryankemper@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=wdqs,name=eqiad [20:43:59] !log legoktm@wtp1025:~$ sudo systemctl restart php7.2-fpm # restart php-fpm, opcache hit ratio was warning [20:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:08] !log [WDQS] Returning `wdqs` dns discovery to the expected status of `(eqiad, codfw) = (depooled, pooled)`: `sudo confctl --object-type discovery select 'dnsdisc=wdqs,name=eqiad' set/pooled=false` [20:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:40] ^ This will resolve https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=puppetmaster1001&service=DNS+Discovery+operations+diffs, forcing a re-check now [20:50:41] PROBLEM - puppet last run on ml-serve2001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:58:21] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=puppetmaster1001&service=DNS+Discovery+operations+diffs didn't resolve because there's another diff error, but the wdqs part is no longer the source of the error /shrug [20:58:46] 10SRE, 10Services, 10Wikibase-Quality-Constraints, 10Wikidata, and 3 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Legoktm) >>! In T285104#7240005, @Michael wrote: >>>! In T285104#7239822, @Joe wrote: >> * How stringen... [20:59:43] ryankemper: I'll take care of the mwdebug diff, thanks :D [21:00:03] great :) [21:01:34] 10SRE, 10Services, 10Wikibase-Quality-Constraints, 10Wikidata, and 3 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Legoktm) >>! In T285104#7239822, @Joe wrote: > I think there are two options, depending on the level of... [21:01:53] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:06:19] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Raid battery stuck in recharging for cloudvirt1012.eqiad.wmnet - https://phabricator.wikimedia.org/T286748 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [21:09:35] (03CR) 10Cwhite: [C: 03+2] rsyslog: add gitlab input-file entries to lookup table [puppet] - 10https://gerrit.wikimedia.org/r/708160 (https://phabricator.wikimedia.org/T274462) (owner: 10Cwhite) [21:11:51] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Raid battery stuck in recharging for cloudvirt1012.eqiad.wmnet - https://phabricator.wikimedia.org/T286748 (10Jclark-ctr) Replaced failed battery from purchase T245697 [21:13:03] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:15:12] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1129 - https://phabricator.wikimedia.org/T285715 (10Jclark-ctr) received replacement drive [21:17:34] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1129 - https://phabricator.wikimedia.org/T285715 (10Jclark-ctr) @Marostegui can this drive be replaced? [21:18:31] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:22:20] (03PS1) 10Andrew Bogott: Trove: allow more db instances per tenant and bigger db volumes [puppet] - 10https://gerrit.wikimedia.org/r/708353 [21:23:26] (03CR) 10Andrew Bogott: [C: 03+2] Trove: allow more db instances per tenant and bigger db volumes [puppet] - 10https://gerrit.wikimedia.org/r/708353 (owner: 10Andrew Bogott) [21:25:04] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Raid battery stuck in recharging for cloudvirt1012.eqiad.wmnet - https://phabricator.wikimedia.org/T286748 (10Jclark-ctr) 05Open→03Resolved [21:26:29] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:28:28] (03PS1) 10Jdlrobson: Restore print, links, table and message box styles [skins/Vector] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708220 (https://phabricator.wikimedia.org/T278896) [21:35:29] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:37:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10Jclark-ctr) @Gehel would like to confirm racking. evenly spread across rows? it is replacing elastic[1032-1047... [21:44:43] (03PS1) 10Ahmon Dancy: Set $wmgUseScore to false in train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/708360 [21:45:11] (03CR) 10Ahmon Dancy: [C: 03+2] Set $wmgUseScore to false in train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/708360 (owner: 10Ahmon Dancy) [21:46:08] (03Merged) 10jenkins-bot: Set $wmgUseScore to false in train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/708360 (owner: 10Ahmon Dancy) [21:53:06] 10SRE, 10MW-on-K8s, 10serviceops: GitInfo is missing from mwdebug-kubernetes deployment - https://phabricator.wikimedia.org/T287512 (10Legoktm) Currently scap generates the GitInfo "cache" files, see https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/tools/scap/+/refs/heads/master/scap/tasks.py#151 and... [21:54:11] dancy: is that something I should've been aware about when re-enabling Score? do we need to setup Shellbox in that environment? [21:54:42] Maybe some day to increase coverage in train-dev, but not today. [21:54:57] train-dev is very stripped down [21:55:51] 10SRE, 10MW-on-K8s, 10serviceops: GitInfo is missing from mwdebug-kubernetes deployment - https://phabricator.wikimedia.org/T287512 (10dancy) a:03dancy I can take this. [22:00:43] (03PS1) 10Legoktm: Move ruwikinews to large wikis dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708364 [22:02:03] (03CR) 10jerkins-bot: [V: 04-1] Move ruwikinews to large wikis dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708364 (owner: 10Legoktm) [22:02:21] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:06:35] (03PS2) 10Legoktm: Move ruwikinews to large wikis dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708364 [22:10:31] (03CR) 10Legoktm: "Practically, this will remove the wiki from global abuse filters." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708364 (owner: 10Legoktm) [22:11:30] (03CR) 10RhinosF1: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708364 (owner: 10Legoktm) [22:15:45] 10SRE, 10MW-on-K8s, 10serviceops: Make all httpbb tests pass on the mwdebug deployment. - https://phabricator.wikimedia.org/T285298 (10jeena) [22:16:28] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Doing): Check out www-portals repo in the mediawiki-webserver and in the mediawiki-multiversion images - https://phabricator.wikimedia.org/T285325 (10jeena) 05Open→03Resolved a:03jeena We had updated the jenkins confi... [22:18:02] (03PS1) 10Jdlrobson: alertmanager: route readers web team alerts [puppet] - 10https://gerrit.wikimedia.org/r/708369 (https://phabricator.wikimedia.org/T281359) [22:19:39] (03CR) 10Legoktm: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708364 (owner: 10Legoktm) [22:20:24] legoktm: yeah somewhere meta might be best [22:20:34] (03CR) 10Jdlrobson: "I think this does the onboarding step, but I'm a bit lost on what I do after this to make sure https://grafana-rw.wikimedia.org/d/00000056" [puppet] - 10https://gerrit.wikimedia.org/r/708369 (https://phabricator.wikimedia.org/T281359) (owner: 10Jdlrobson) [22:20:37] * legoktm nods [22:20:44] The conduct of them hasn't been anywhere near acceptable [22:21:08] We need somewhere they can't say they didn't see but won't be a <> on [22:23:49] Wikimedia Forum page gets a lot of news on [22:23:52] That might work [22:25:49] (03CR) 10RhinosF1: [C: 03+1] "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708364 (owner: 10Legoktm) [22:26:51] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:39:50] (03PS3) 10Jdlrobson: Enable WVUI search on Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708156 (https://phabricator.wikimedia.org/T287215) [22:44:06] (03PS1) 10Legoktm: Stop enabling DPL on new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708374 (https://phabricator.wikimedia.org/T287380) [22:49:58] (03CR) 10Nray: [C: 03+1] Enable user links on office + test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708152 (https://phabricator.wikimedia.org/T287391) (owner: 10Clare Ming) [22:52:19] (03PS11) 10Clare Ming: Enable user links on office + test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708152 (https://phabricator.wikimedia.org/T287391) [23:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210727T2300). [23:00:05] cjming and jdlrobson: A patch you scheduled for Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:16] 10SRE, 10DynamicPageList (Wikimedia), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Legoktm) Notwithstanding potential improvements to DPL, I think we can take the step of disabling it on projects where it's not being us... [23:01:26] hi - i'm here and ready [23:02:00] full disclosure: this is my first time deploying if you all are comfortable with that [23:02:45] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:03:38] exciting! I'm lurking around if you need any assistance [23:03:53] should I go ahead and start? i'm assuming it's straight up exactly the cmds on https://deploy-commands.toolforge.org/bacc/708152 [23:04:41] here [23:04:47] looks clear to me :) [23:04:57] glhf! :D [23:05:05] (03PS3) 10Thcipriani: logspam-watch: correctly handle 0 for total error counts [puppet] - 10https://gerrit.wikimedia.org/r/685231 (https://phabricator.wikimedia.org/T281121) (owner: 10Brennen Bearnes) [23:05:06] +1 :) [23:05:23] ok - is it ok to +2 my own patch? or should one of the seasoned deployers do it? [23:06:54] for operations/mediawiki-config and wmf/* branches of other repos (where it's been merged into mainline) when you're deploying: fine to +2 your own, I think [23:07:20] yep it's okay to plus 2 in this situation [23:07:27] (lots of hedging in that answer, but tl;dr if you're doing the backport: you're good :)) [23:07:27] cool - thanks [23:07:41] (03CR) 10Clare Ming: [C: 03+2] Enable user links on office + test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708152 (https://phabricator.wikimedia.org/T287391) (owner: 10Clare Ming) [23:08:01] PROBLEM - snapshot of s2 in eqiad on alert1001 is CRITICAL: Last snapshot for s2 at eqiad (db1102.eqiad.wmnet:3312) taken on 2021-07-27 20:52:07 is 1048 GB, but previous one was 882 GB, a change of 18.8% https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [23:09:12] (03Merged) 10jenkins-bot: Enable user links on office + test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708152 (https://phabricator.wikimedia.org/T287391) (owner: 10Clare Ming) [23:09:41] so now i can start cmds bec ^^ ? [23:10:13] yep, you should be good to start fetching on the deployment host [23:10:29] cjming: I can test this for you if that takes off a little of the stress :) [23:11:33] ok - we can test on mwdebug1002 [23:12:14] cjming: it looks like it's working correctly to me. [23:12:21] I see the new user links on test and office wiki but not the other projects [23:12:29] i see user links on test + office wikis too [23:13:27] alrighty - onward [23:15:40] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:708152|Enable user links on office + test wikis (T287391)]] (duration: 02m 00s) [23:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:49] T287391: Enable user links on office wiki and test wiki - https://phabricator.wikimedia.org/T287391 [23:16:13] w00p!!! [23:16:33] seeing the change without x-debug where I expect to and where I don't expect it! :) [23:16:45] phew! [23:17:51] \o/ [23:18:20] first deploy == hardest deploy :) [23:18:38] lol - my blood pressure is falling [23:18:44] (03CR) 10Legoktm: logspam-watch: correctly handle 0 for total error counts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685231 (https://phabricator.wikimedia.org/T281121) (owner: 10Brennen Bearnes) [23:19:05] haha: I've had the thought that deployer blood pressure is a good metric to aim to improve in an okr [23:19:10] cjming: so... are you feeling comfortable doing 2 more? :) [23:19:39] uh - i can -- sure [23:19:49] It's the two listed in https://wikitech.wikimedia.org/wiki/Deployments#Tuesday,_July_27 [23:20:01] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/708156 is more or less a repeat of what you just did [23:20:36] the other one is a backport (not config) [23:20:50] ok - does it matter that https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/708156 says merge conflict? [23:21:04] it should be rebased presumably? [23:21:08] ^ yep [23:22:17] the way we have mw-config setup is that history must be linear, so it requres a lot of manual rebases, but it makes reverts easy [23:22:46] Jdlrobson> can you rebase 708156 ? [23:24:20] 10SRE, 10DynamicPageList (Wikimedia), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Legoktm) >>! In T287380#7238964, @Bawolff wrote: > I personally think it would be reasonable to do a hard cut-off at wikis with > 100,00... [23:24:27] hrm, is that one up-to-date already? [23:24:42] > Change is up to date with the target branch already (master) [23:24:59] (03PS4) 10Jdlrobson: Enable WVUI search on Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708156 (https://phabricator.wikimedia.org/T287215) [23:25:06] Done: ^ [23:25:29] (03CR) 10Thcipriani: [C: 03+2] Enable WVUI search on Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708156 (https://phabricator.wikimedia.org/T287215) (owner: 10Jdlrobson) [23:25:29] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:25:31] i'm going to hand over the reigns to and go lie down [23:25:36] :D [23:25:46] cjming: no worries :) Well done for getting that first one under your belt ! [23:25:58] congrats on the deploy cjming [23:26:13] ty ty all for your patience and encouragement [23:26:18] I swear we'll make it easier Soon™ [23:26:31] (03Merged) 10jenkins-bot: Enable WVUI search on Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708156 (https://phabricator.wikimedia.org/T287215) (owner: 10Jdlrobson) [23:28:14] 1002 again? [23:28:16] Jdlrobson: live on mwdebug2002, check please [23:28:34] I like to keep you guessing [23:29:11] (03CR) 10Brennen Bearnes: logspam-watch: correctly handle 0 for total error counts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685231 (https://phabricator.wikimedia.org/T281121) (owner: 10Brennen Bearnes) [23:29:16] thcipriani: mm.. something's not behaving how expected (looking) [23:30:09] (03CR) 10Thcipriani: [C: 03+2] "Backport" [skins/Vector] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708220 (https://phabricator.wikimedia.org/T278896) (owner: 10Jdlrobson) [23:30:18] cjming: 'grats on that first deploy. :) [23:30:54] thcipriani: we need to roll back the config change, it's not ready [23:30:58] hit a snag. I'll open a bug [23:31:06] Jdlrobson: ack, doing [23:31:28] thcipriani: one sec.. [23:31:37] just checking something with my PM before I completely commit to that decision [23:32:14] (03PS1) 10Thcipriani: Revert "Enable WVUI search on Wikimedia Commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708381 [23:33:15] (already reverted on mwdebug2002, FYI) [23:33:39] ack [23:34:46] (03CR) 10Legoktm: logspam-watch: correctly handle 0 for total error counts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685231 (https://phabricator.wikimedia.org/T281121) (owner: 10Brennen Bearnes) [23:35:00] (03CR) 10Legoktm: [C: 03+2] logspam-watch: correctly handle 0 for total error counts [puppet] - 10https://gerrit.wikimedia.org/r/685231 (https://phabricator.wikimedia.org/T281121) (owner: 10Brennen Bearnes) [23:36:00] (03CR) 10Thcipriani: [C: 03+1] gerrit: remove unused settings from [container] [puppet] - 10https://gerrit.wikimedia.org/r/708103 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [23:36:28] (03CR) 10Thcipriani: [C: 03+1] gerrit: remove unused container.javaOptions values [puppet] - 10https://gerrit.wikimedia.org/r/708104 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [23:36:31] thcipriani: okay seems like that was the right decision [23:36:37] so just Vector to follow up on [23:36:50] (03CR) 10Thcipriani: [C: 03+2] Revert "Enable WVUI search on Wikimedia Commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708381 (owner: 10Thcipriani) [23:36:55] Jdlrobson: ack [23:37:25] * thcipriani waits on zuul [23:37:33] or jenkins I guess [23:37:42] (03Merged) 10jenkins-bot: Revert "Enable WVUI search on Wikimedia Commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708381 (owner: 10Thcipriani) [23:38:51] hah [23:45:18] 10SRE, 10Datacenter-Switchover: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 (10Legoktm) p:05Triage→03Medium [23:48:52] (03Merged) 10jenkins-bot: Restore print, links, table and message box styles [skins/Vector] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708220 (https://phabricator.wikimedia.org/T278896) (owner: 10Jdlrobson) [23:49:10] woot [23:49:14] @thcipriani we're back on!^ [23:49:43] * thcipriani goes [23:51:07] Jdlrobson: live on mwdebug2002, check please [23:52:19] Tables have styling again on https://www.mediawiki.org/wiki/Special:Version?useskinversion=2 so that's good enough for me! [23:52:26] sync away! [23:52:29] cool, going live [23:53:51] !log thcipriani@deploy1002 Synchronized php-1.37.0-wmf.16/skins/Vector: Backport: [[gerrit:708220|Restore print, links, table and message box styles (T278896)]] (duration: 01m 07s) [23:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:59] T278896: Vector will not use the `legacy` ResourceLoaderSkinModule - https://phabricator.wikimedia.org/T278896 [23:54:14] ^ Jdlrobson live [23:57:08] thanks thcipriani [23:57:11] phew [23:57:17] that's one less phab ticket i'm getting this week :) [23:57:21] heh [23:57:28] always a good feeling