[00:00:32] tgr: looking [00:01:04] thcipriani: thanks, I'm doing it. Just wanted to give a heads up. [00:01:10] ah, k [00:01:12] should have been clearer about that. [00:01:16] thanks for the heads up :) [00:02:04] * thcipriani can never remember who has deploy creds and who doesn't [00:08:28] (03PS1) 10Thcipriani: feat: Add mwdebug cname [dns] - 10https://gerrit.wikimedia.org/r/708874 [00:10:19] (03PS2) 10Thcipriani: feat: Add mwdebug cname [dns] - 10https://gerrit.wikimedia.org/r/708874 [00:14:24] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) had [[ https://grafana.wikimedia.org/d/C0lCOf3Mz/puppetdb-postgres?orgId=1&from=1627588800000&to=1627613999000 | had another issue tonigh... [00:15:31] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [00:16:50] (03Merged) 10jenkins-bot: Add a link: Show article extract instead of description in the link inspector [extensions/GrowthExperiments] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708848 (https://phabricator.wikimedia.org/T287636) (owner: 10Gergő Tisza) [00:17:16] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) p:05Medium→03High [00:30:42] doesn't seem to be working quite as expected, I'll revert. [00:31:12] (wasn't deployed beyond mwdebug) [00:34:23] (03PS1) 10Gergő Tisza: Revert "Add a link: Show article extract instead of description in the link inspector" [extensions/GrowthExperiments] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708849 [00:35:19] (03CR) 10Gergő Tisza: [C: 03+2] Revert "Add a link: Show article extract instead of description in the link inspector" [extensions/GrowthExperiments] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708849 (owner: 10Gergő Tisza) [00:44:59] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) [00:45:39] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) [00:56:01] (03Merged) 10jenkins-bot: Revert "Add a link: Show article extract instead of description in the link inspector" [extensions/GrowthExperiments] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708849 (owner: 10Gergő Tisza) [01:15:15] (done) [01:18:02] 10SRE, 10MediaWiki-Uploading, 10Traffic, 10Chinese-Sites, 10Performance Issue: Adding an image to zh.wp 长江桥隧列表 article throws HTTP 503 or 504 error - https://phabricator.wikimedia.org/T285160 (10Sunny00217) 05Open→03Invalid Can't reappear problem again. [02:11:48] (03CR) 10Legoktm: "This seems reasonable to me, but could you file a ticket in #SRE for this?" [dns] - 10https://gerrit.wikimedia.org/r/708874 (owner: 10Thcipriani) [02:14:01] (03CR) 10Legoktm: "Actually I take the reasonable part back sorry, this introduces a manual step as part of the DC switchover process. That's not a hard bloc" [dns] - 10https://gerrit.wikimedia.org/r/708874 (owner: 10Thcipriani) [02:17:45] (03PS1) 10Andrew Bogott: Partially revert "cloud-vps cloud-init: more tweaks to try to get a perfectly clean run" [puppet] - 10https://gerrit.wikimedia.org/r/708879 (https://phabricator.wikimedia.org/T287309) [02:18:23] (03CR) 10jerkins-bot: [V: 04-1] Partially revert "cloud-vps cloud-init: more tweaks to try to get a perfectly clean run" [puppet] - 10https://gerrit.wikimedia.org/r/708879 (https://phabricator.wikimedia.org/T287309) (owner: 10Andrew Bogott) [02:19:38] (03PS2) 10Andrew Bogott: Partially revert "cloud-vps cloud-init: more tweaks" [puppet] - 10https://gerrit.wikimedia.org/r/708879 (https://phabricator.wikimedia.org/T287309) [02:20:42] (03CR) 10Andrew Bogott: [C: 03+2] Partially revert "cloud-vps cloud-init: more tweaks" [puppet] - 10https://gerrit.wikimedia.org/r/708879 (https://phabricator.wikimedia.org/T287309) (owner: 10Andrew Bogott) [02:46:35] PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:50:13] RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:30:25] 10SRE, 10ops-codfw: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 (10wkandek) Papaul, Effie is on vacation for 2 more weeks. Just FYI, so don't expect any action here soon. [03:40:05] (03PS1) 10Tim Starling: PNGMetadataExtractor: skip oversize chunks instead of aborting [core] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708850 (https://phabricator.wikimedia.org/T286273) [03:40:33] (03CR) 10Tim Starling: [C: 03+2] PNGMetadataExtractor: skip oversize chunks instead of aborting [core] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708850 (https://phabricator.wikimedia.org/T286273) (owner: 10Tim Starling) [03:59:55] (03Merged) 10jenkins-bot: PNGMetadataExtractor: skip oversize chunks instead of aborting [core] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708850 (https://phabricator.wikimedia.org/T286273) (owner: 10Tim Starling) [04:37:01] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:55:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 5%: After upgrae', diff saved to https://phabricator.wikimedia.org/P16928 and previous config saved to /var/cache/conftool/dbconfig/20210730-045520-root.json [04:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:05] !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.16/includes/media/PNGMetadataExtractor.php: fix broken PNG thumbnails T286273 (duration: 00m 57s) [04:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:11] T286273: Image size is not determined for new PNG files with (partially) corrupt metadata - https://phabricator.wikimedia.org/T286273 [05:04:14] (03PS1) 10Marostegui: Revert "db2104: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/708857 [05:05:34] (03CR) 10Marostegui: [C: 03+2] Revert "db2104: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/708857 (owner: 10Marostegui) [05:10:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 10%: After upgrae', diff saved to https://phabricator.wikimedia.org/P16929 and previous config saved to /var/cache/conftool/dbconfig/20210730-051024-root.json [05:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:47] (03CR) 10DannyS712: [C: 03+1] "Core patch has merged, suggest proceeding with this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701467 (owner: 10Tim Starling) [05:25:19] !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.16/tests/phpunit/includes/media/PNGMetadataExtractorTest.php: fix broken PNG thumbnails T286273 (duration: 00m 57s) [05:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:26] T286273: Image size is not determined for new PNG files with (partially) corrupt metadata - https://phabricator.wikimedia.org/T286273 [05:25:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 15%: After upgrae', diff saved to https://phabricator.wikimedia.org/P16930 and previous config saved to /var/cache/conftool/dbconfig/20210730-052527-root.json [05:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 25%: After upgrae', diff saved to https://phabricator.wikimedia.org/P16931 and previous config saved to /var/cache/conftool/dbconfig/20210730-054031-root.json [05:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 50%: After upgrae', diff saved to https://phabricator.wikimedia.org/P16932 and previous config saved to /var/cache/conftool/dbconfig/20210730-055537-root.json [05:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 75%: After upgrae', diff saved to https://phabricator.wikimedia.org/P16933 and previous config saved to /var/cache/conftool/dbconfig/20210730-061041-root.json [06:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:33] (03CR) 10Giuseppe Lavagetto: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/708771 (https://phabricator.wikimedia.org/T287570) (owner: 10Giuseppe Lavagetto) [06:20:21] (03CR) 10Giuseppe Lavagetto: "Merging, then I'll fix the permissions on the docker registries to include the deployment servers." [puppet] - 10https://gerrit.wikimedia.org/r/708771 (https://phabricator.wikimedia.org/T287570) (owner: 10Giuseppe Lavagetto) [06:20:26] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::kubernetes::deployment_server: add automation for mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/708771 (https://phabricator.wikimedia.org/T287570) (owner: 10Giuseppe Lavagetto) [06:25:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 100%: After upgrae', diff saved to https://phabricator.wikimedia.org/P16934 and previous config saved to /var/cache/conftool/dbconfig/20210730-062545-root.json [06:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:22] (03PS1) 10Giuseppe Lavagetto: hiera: centralize docker::registry variable [puppet] - 10https://gerrit.wikimedia.org/r/708960 [06:58:27] (03PS1) 10Jcrespo: bacula: Add jobid propery to output of command to list job executions [puppet] - 10https://gerrit.wikimedia.org/r/708963 [06:59:48] (03PS2) 10Jcrespo: bacula: Add jobid property to output of command to list job executions [puppet] - 10https://gerrit.wikimedia.org/r/708963 [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210730T0700) [07:05:35] (03PS3) 10Jcrespo: bacula: Add jobid property to output of command to list job executions [puppet] - 10https://gerrit.wikimedia.org/r/708963 [07:10:38] (03PS4) 10Muehlenhoff: Default nginx::profile to light flavour [puppet] - 10https://gerrit.wikimedia.org/r/702669 (https://phabricator.wikimedia.org/T164456) [07:11:08] (03CR) 10jerkins-bot: [V: 04-1] Default nginx::profile to light flavour [puppet] - 10https://gerrit.wikimedia.org/r/702669 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [07:12:01] (03CR) 10Jcrespo: [C: 03+2] bacula: Add jobid property to output of command to list job executions [puppet] - 10https://gerrit.wikimedia.org/r/708963 (owner: 10Jcrespo) [07:12:36] (03PS5) 10Muehlenhoff: Default nginx::profile to light flavour [puppet] - 10https://gerrit.wikimedia.org/r/702669 (https://phabricator.wikimedia.org/T164456) [07:13:39] (03PS1) 10Jcrespo: bacula: Add jobid property to output of command to list job executions [puppet] - 10https://gerrit.wikimedia.org/r/708967 [07:13:55] (03PS2) 10Jcrespo: bacula: Add jobid property to output of command to list job executions [puppet] - 10https://gerrit.wikimedia.org/r/708967 [07:15:03] (03PS2) 10Muehlenhoff: ganeti: Add ganeti test cluster to locations [software/spicerack] - 10https://gerrit.wikimedia.org/r/708763 (https://phabricator.wikimedia.org/T286206) [07:15:15] (03PS3) 10Jcrespo: bacula: Fix typo on command line help for check_bacula.py [puppet] - 10https://gerrit.wikimedia.org/r/708967 [07:15:32] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] bacula: Fix typo on command line help for check_bacula.py [puppet] - 10https://gerrit.wikimedia.org/r/708967 (owner: 10Jcrespo) [07:20:49] (03PS1) 10Muehlenhoff: profile::tlsproxy::instance: Default to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/708969 (https://phabricator.wikimedia.org/T164456) [07:24:36] (03PS1) 10Marostegui: dbproxy1012,dbproxy1014: Replace db1125 with db1117:3321 [puppet] - 10https://gerrit.wikimedia.org/r/708970 (https://phabricator.wikimedia.org/T286329) [07:25:42] (03PS3) 10Muehlenhoff: ganeti: Add ganeti test cluster to locations [software/spicerack] - 10https://gerrit.wikimedia.org/r/708763 (https://phabricator.wikimedia.org/T286206) [07:26:10] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/708969 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [07:29:27] (03PS2) 10Giuseppe Lavagetto: hiera: centralize docker::registry variable [puppet] - 10https://gerrit.wikimedia.org/r/708960 [07:29:29] (03PS1) 10Giuseppe Lavagetto: docker_registry_ha: require authentication from deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/708971 [07:32:13] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30438/console" [puppet] - 10https://gerrit.wikimedia.org/r/708960 (owner: 10Giuseppe Lavagetto) [07:32:51] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] hiera: centralize docker::registry variable [puppet] - 10https://gerrit.wikimedia.org/r/708960 (owner: 10Giuseppe Lavagetto) [07:34:00] (03PS1) 10Jcrespo: bacula: Increase ES backups retention to 90 days [puppet] - 10https://gerrit.wikimedia.org/r/708972 (https://phabricator.wikimedia.org/T282249) [07:37:46] (03CR) 10Jcrespo: [C: 03+1] "Conf ok (checked hostnames, ips and sections), nit on description: "failvoer"" [puppet] - 10https://gerrit.wikimedia.org/r/708970 (https://phabricator.wikimedia.org/T286329) (owner: 10Marostegui) [07:38:33] (03PS2) 10Marostegui: dbproxy1012,dbproxy1014: Replace db1125 with db1117:3321 [puppet] - 10https://gerrit.wikimedia.org/r/708970 (https://phabricator.wikimedia.org/T286329) [07:40:05] (03CR) 10Marostegui: [C: 03+2] dbproxy1012,dbproxy1014: Replace db1125 with db1117:3321 [puppet] - 10https://gerrit.wikimedia.org/r/708970 (https://phabricator.wikimedia.org/T286329) (owner: 10Marostegui) [07:42:30] (03PS1) 10Marostegui: Revert "dbproxy1013,dbproxy1015: Add db1124 as failover" [puppet] - 10https://gerrit.wikimedia.org/r/708865 [07:44:09] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1013,dbproxy1015: Add db1124 as failover" [puppet] - 10https://gerrit.wikimedia.org/r/708865 (owner: 10Marostegui) [07:47:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:49:31] (03CR) 10Jcrespo: [C: 03+2] bacula: Increase ES backups retention to 90 days [puppet] - 10https://gerrit.wikimedia.org/r/708972 (https://phabricator.wikimedia.org/T282249) (owner: 10Jcrespo) [07:53:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:54:24] (03PS1) 10Muehlenhoff: Update Cumin aliases for Ganeti test cluster [puppet] - 10https://gerrit.wikimedia.org/r/708973 (https://phabricator.wikimedia.org/T286206) [07:55:06] (03PS2) 10Giuseppe Lavagetto: docker_registry_ha: require authentication from deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/708971 [07:55:57] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30440/console" [puppet] - 10https://gerrit.wikimedia.org/r/708971 (owner: 10Giuseppe Lavagetto) [07:56:33] (03CR) 10Muehlenhoff: [C: 03+2] Update Cumin aliases for Ganeti test cluster [puppet] - 10https://gerrit.wikimedia.org/r/708973 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [07:57:08] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] "pcc looks good" [puppet] - 10https://gerrit.wikimedia.org/r/708971 (owner: 10Giuseppe Lavagetto) [07:57:40] (03PS1) 10DCausse: flink-session-cluster: Move image name and version under main_app [deployment-charts] - 10https://gerrit.wikimedia.org/r/708974 (https://phabricator.wikimedia.org/T287374) [07:57:42] (03PS1) 10DCausse: rdf-streaming-updater: Cleanup image tags under docker [deployment-charts] - 10https://gerrit.wikimedia.org/r/708975 (https://phabricator.wikimedia.org/T287374) [07:58:54] <_joe_> moritzm: can I merge your change? [07:59:09] <_joe_> I would assume yes [07:59:15] (03PS2) 10Filippo Giunchedi: pontoon: wait for puppetdb to be up before enabling it [puppet] - 10https://gerrit.wikimedia.org/r/708033 [07:59:22] <_joe_> {{done}} [07:59:24] (03CR) 10Filippo Giunchedi: pontoon: wait for puppetdb to be up before enabling it (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/708033 (owner: 10Filippo Giunchedi) [08:02:51] (03PS1) 10Muehlenhoff: addnode cookbook: Also allow ganeti test cluster role [cookbooks] - 10https://gerrit.wikimedia.org/r/708976 (https://phabricator.wikimedia.org/T286206) [08:08:09] sorry, please go ahead [08:13:05] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [08:14:07] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) 05Open→03Resolved [08:22:10] (03PS1) 10Jcrespo: bacula: Adjust number of max volumes for content database backups [puppet] - 10https://gerrit.wikimedia.org/r/708977 (https://phabricator.wikimedia.org/T282249) [08:23:47] (03PS3) 10Filippo Giunchedi: pontoon: wait for puppetdb to be up before enabling it [puppet] - 10https://gerrit.wikimedia.org/r/708033 [08:24:33] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10MoritzMuehlenhoff) Until the root cause has been found and fixed within the autovacuum logic; how about we add a reconciliation systemd timer wh... [08:36:12] (03PS1) 10Marostegui: Revert "db1124, db1125: Enable notifications." [puppet] - 10https://gerrit.wikimedia.org/r/708986 [08:37:12] (03CR) 10Marostegui: [C: 03+2] Revert "db1124, db1125: Enable notifications." [puppet] - 10https://gerrit.wikimedia.org/r/708986 (owner: 10Marostegui) [08:37:22] (03PS1) 10Giuseppe Lavagetto: docker: introduce docker::credentials [puppet] - 10https://gerrit.wikimedia.org/r/708979 [08:38:17] (03CR) 10Cathal Mooney: [C: 03+2] Adding flag for asw2-a-eqiad and asw2-b-eqiad to configure class-of-service shared buffer config. This will keep it in line with config add [homer/public] - 10https://gerrit.wikimedia.org/r/708784 (https://phabricator.wikimedia.org/T284592) (owner: 10Cathal Mooney) [08:39:02] (03Merged) 10jenkins-bot: Adding flag for asw2-a-eqiad and asw2-b-eqiad to configure class-of-service shared buffer config. This will keep it in line with config added manually under T286061 and T286032 [homer/public] - 10https://gerrit.wikimedia.org/r/708784 (https://phabricator.wikimedia.org/T284592) (owner: 10Cathal Mooney) [08:45:33] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30441/console" [puppet] - 10https://gerrit.wikimedia.org/r/708979 (owner: 10Giuseppe Lavagetto) [08:53:09] PROBLEM - Too high an incoming rate of browser-reported Network Error Logging events on alert1001 is CRITICAL: type=tcp.timed_out https://wikitech.wikimedia.org/wiki/Network_monitoring%23NEL_alerts https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 [08:53:48] (03PS2) 10Giuseppe Lavagetto: docker: introduce docker::credentials [puppet] - 10https://gerrit.wikimedia.org/r/708979 [08:54:18] (03CR) 10jerkins-bot: [V: 04-1] docker: introduce docker::credentials [puppet] - 10https://gerrit.wikimedia.org/r/708979 (owner: 10Giuseppe Lavagetto) [08:55:59] (03PS3) 10Giuseppe Lavagetto: docker: introduce docker::credentials [puppet] - 10https://gerrit.wikimedia.org/r/708979 [08:56:08] !log running homer against asw2-a-eqiad and asw2-b-eqiad to bring homer in line with manual config added for buffer mem. T284592 [08:56:11] RECOVERY - Too high an incoming rate of browser-reported Network Error Logging events on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Network_monitoring%23NEL_alerts https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 [08:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:16] T284592: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 [08:57:48] <_joe_> topranks: it has nothing to do with your change, but got a peak of NEL reports for timeouts in vietnam [08:58:01] <_joe_> maybe something not right with paths to eqsin I'd imagine [08:58:24] yeah was just looking at them [08:58:32] (03CR) 10Btullis: Add a CNAME for analytics-test-presto.eqiad.wmnet (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/705376 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [08:58:35] <_joe_> oh all one ISP specifically [08:58:35] alert has cleared but i'll dig in and see if i can see any pattern [08:58:39] <_joe_> topranks: <3 [08:58:47] one step ahead of me :) [08:58:48] thanks [09:01:57] (03CR) 10Btullis: [C: 03+2] Update TLS configuration for analytics-test-presto [puppet] - 10https://gerrit.wikimedia.org/r/708739 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [09:05:41] (03PS4) 10Giuseppe Lavagetto: docker: introduce docker::credentials [puppet] - 10https://gerrit.wikimedia.org/r/708979 [09:29:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:30:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:32:09] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@c6cfa85]: Add non-public source to render tegola MVT in maps2007 [09:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:31] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@c6cfa85]: Add non-public source to render tegola MVT in maps2007 (duration: 00m 21s) [09:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:55] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@289d3a9]: Add public source to render tegola MVT in maps2007 temporarily [09:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:17] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@289d3a9]: Add public source to render tegola MVT in maps2007 temporarily (duration: 00m 21s) [09:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:17] (03PS5) 10Giuseppe Lavagetto: docker: introduce docker::credentials [puppet] - 10https://gerrit.wikimedia.org/r/708979 [09:46:43] (03CR) 10Jelto: [V: 03+1 C: 03+2] icinga::monitor::gitlab add alerts for https and ssh for gitlab [puppet] - 10https://gerrit.wikimedia.org/r/708530 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [09:47:06] (03PS4) 10Jelto: icinga::monitor::gitlab add alerts for https and ssh for gitlab [puppet] - 10https://gerrit.wikimedia.org/r/708530 (https://phabricator.wikimedia.org/T275170) [09:48:20] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30445/console" [puppet] - 10https://gerrit.wikimedia.org/r/708979 (owner: 10Giuseppe Lavagetto) [09:50:01] (03PS2) 10Jcrespo: bacula: Adjust number of max volumes for database backups [puppet] - 10https://gerrit.wikimedia.org/r/708977 (https://phabricator.wikimedia.org/T282249) [09:50:03] (03PS1) 10Jcrespo: bacula: Uniformize the backup of databases to be full weekly [puppet] - 10https://gerrit.wikimedia.org/r/709006 [09:56:19] (03PS6) 10Giuseppe Lavagetto: docker: introduce docker::credentials [puppet] - 10https://gerrit.wikimedia.org/r/708979 [09:57:00] (03PS3) 10Jcrespo: bacula: Adjust number of max volumes for database backups [puppet] - 10https://gerrit.wikimedia.org/r/708977 (https://phabricator.wikimedia.org/T282249) [10:02:32] (03PS4) 10Jcrespo: bacula: Adjust number of max volumes for database backups [puppet] - 10https://gerrit.wikimedia.org/r/708977 (https://phabricator.wikimedia.org/T282249) [10:03:00] (03PS5) 10Jcrespo: bacula: Adjust number of max volumes for database backups [puppet] - 10https://gerrit.wikimedia.org/r/708977 (https://phabricator.wikimedia.org/T282249) [10:04:30] (03PS2) 10Jcrespo: bacula: Uniformize the backup of databases to be full weekly [puppet] - 10https://gerrit.wikimedia.org/r/709006 [10:04:53] (03CR) 10Jcrespo: [C: 03+2] bacula: Adjust number of max volumes for database backups [puppet] - 10https://gerrit.wikimedia.org/r/708977 (https://phabricator.wikimedia.org/T282249) (owner: 10Jcrespo) [10:06:23] (03CR) 10Jcrespo: [C: 03+2] bacula: Uniformize the backup of databases to be full weekly [puppet] - 10https://gerrit.wikimedia.org/r/709006 (owner: 10Jcrespo) [10:14:40] PROBLEM - bacula director process on backup1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (bacula), command name bacula-dir https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [10:14:49] ^that's me, fixing [10:15:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=bacula site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:17:37] (03CR) 10Giuseppe Lavagetto: [C: 03+2] docker: introduce docker::credentials [puppet] - 10https://gerrit.wikimedia.org/r/708979 (owner: 10Giuseppe Lavagetto) [10:23:40] (03PS1) 10Filippo Giunchedi: hieradata: add 'role' for prometheus service [puppet] - 10https://gerrit.wikimedia.org/r/709010 [10:23:42] (03PS1) 10Filippo Giunchedi: hieradata: easier navigation for service::catalog [puppet] - 10https://gerrit.wikimedia.org/r/709011 [10:23:48] (03CR) 10Jgiannelos: [C: 03+2] [push-notifications] Hygiene: Remove invalid TODO [deployment-charts] - 10https://gerrit.wikimedia.org/r/708798 (owner: 10Mholloway) [10:25:02] (03PS1) 10Jcrespo: bacula: Fix pool name on eqiad databases [puppet] - 10https://gerrit.wikimedia.org/r/709012 [10:26:26] (03Merged) 10jenkins-bot: [push-notifications] Hygiene: Remove invalid TODO [deployment-charts] - 10https://gerrit.wikimedia.org/r/708798 (owner: 10Mholloway) [10:26:30] (03CR) 10Jcrespo: [C: 03+2] bacula: Fix pool name on eqiad databases [puppet] - 10https://gerrit.wikimedia.org/r/709012 (owner: 10Jcrespo) [10:33:06] RECOVERY - bacula director process on backup1001 is OK: PROCS OK: 1 process with UID = 112 (bacula), command name bacula-dir https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [10:34:05] (03CR) 10Mark Bergsma: "I agree that we should a) disable DPL on wikis where it's currently unused, and b) ensure it's not enabled on any more wikis (so as to not" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708374 (https://phabricator.wikimedia.org/T287380) (owner: 10Legoktm) [10:37:38] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:39:29] (03PS1) 10Elukey: Add ml-serve-{eqiad,codfw} to kubernetes_clusters [puppet] - 10https://gerrit.wikimedia.org/r/709014 (https://phabricator.wikimedia.org/T272919) [10:47:05] (03PS1) 10Marostegui: Revert "mariadb: Move db1124 to m2." [puppet] - 10https://gerrit.wikimedia.org/r/708988 [10:47:10] (03PS2) 10Marostegui: Revert "mariadb: Move db1124 to m2." [puppet] - 10https://gerrit.wikimedia.org/r/708988 [10:48:58] (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Move db1124 to m2." [puppet] - 10https://gerrit.wikimedia.org/r/708988 (owner: 10Marostegui) [10:52:51] (03CR) 10Cathal Mooney: [C: 03+2] O:alerting_host: create puppet class for statograph service. (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/708095 (https://phabricator.wikimedia.org/T285569) (owner: 10Cathal Mooney) [11:01:16] (03PS1) 10Muehlenhoff: Remove access for gsingers [puppet] - 10https://gerrit.wikimedia.org/r/709017 [11:01:45] (03CR) 10jerkins-bot: [V: 04-1] Remove access for gsingers [puppet] - 10https://gerrit.wikimedia.org/r/709017 (owner: 10Muehlenhoff) [11:02:22] (03PS1) 10Muehlenhoff: Fix Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/709018 [11:04:23] (03CR) 10Muehlenhoff: [C: 03+2] Fix Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/709018 (owner: 10Muehlenhoff) [11:05:37] (03PS2) 10Muehlenhoff: Remove access for gsingers [puppet] - 10https://gerrit.wikimedia.org/r/709017 [11:21:21] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/709011 (owner: 10Filippo Giunchedi) [11:23:09] !log installing libsndfile security updates on stretch [11:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:18] (03PS2) 10Jbond: admin README - convert to markdown and clarify system user/group docs [puppet] - 10https://gerrit.wikimedia.org/r/708777 (owner: 10Ottomata) [11:29:34] (03CR) 10jerkins-bot: [V: 04-1] admin README - convert to markdown and clarify system user/group docs [puppet] - 10https://gerrit.wikimedia.org/r/708777 (owner: 10Ottomata) [11:45:24] (03PS1) 10Cathal Mooney: O:alerting_host: fix job command for statograph systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/709023 (https://phabricator.wikimedia.org/T285569) [11:48:22] (03CR) 10Jbond: [C: 03+1] O:alerting_host: fix job command for statograph systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/709023 (https://phabricator.wikimedia.org/T285569) (owner: 10Cathal Mooney) [11:51:58] (03CR) 10Cathal Mooney: [C: 03+2] O:alerting_host: fix job command for statograph systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/709023 (https://phabricator.wikimedia.org/T285569) (owner: 10Cathal Mooney) [11:54:29] (03CR) 10Jbond: "> Patch Set 1:" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/708777 (owner: 10Ottomata) [12:03:23] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) >>! In T263578#7248089, @MoritzMuehlenhoff wrote: > Until the root cause has been found and fixed within the autovacuum logic; how about... [12:07:53] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10MoritzMuehlenhoff) >>! In T263578#7248437, @jbond wrote: > As such I wonder if we should be more aggressive then that. under normal operations... [12:09:48] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) >>! In T263578#7248442, @MoritzMuehlenhoff wrote: >>>! In T263578#7248437, @jbond wrote: >> As such I wonder if we should be more aggress... [12:10:20] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/709017 (owner: 10Muehlenhoff) [12:28:44] (03PS1) 10KartikMistry: Update cxserver to [deployment-charts] - 10https://gerrit.wikimedia.org/r/709025 (https://phabricator.wikimedia.org/T286473) [12:28:53] 10SRE, 10Infrastructure-Foundations: Blacklist FUSE - https://phabricator.wikimedia.org/T287753 (10MoritzMuehlenhoff) [12:29:30] 10SRE, 10Infrastructure-Foundations: Block FUSE (kernel module/package) on hosts which don't need it - https://phabricator.wikimedia.org/T287753 (10MoritzMuehlenhoff) p:05Triage→03Medium [12:32:46] (03PS2) 10KartikMistry: Update cxserver to 2021-07-30-121251-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/709025 (https://phabricator.wikimedia.org/T286473) [12:33:57] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709027 (https://phabricator.wikimedia.org/T286765) (owner: 10Awight) [12:36:53] (03PS1) 10Cathal Mooney: Removed verify=False satement from requests session constructor that had been present during initial testing. [software/statograph] - 10https://gerrit.wikimedia.org/r/709028 [12:47:19] (03PS1) 10Filippo Giunchedi: thanos: add query-url to rule [puppet] - 10https://gerrit.wikimedia.org/r/709029 (https://phabricator.wikimedia.org/T287142) [12:47:21] (03PS1) 10Filippo Giunchedi: alertmanager: allow alerts from grafana and thanos-query [puppet] - 10https://gerrit.wikimedia.org/r/709030 (https://phabricator.wikimedia.org/T287142) [12:47:23] (03PS1) 10Filippo Giunchedi: pontoon: allow grafana and thanos to send alerts to am [puppet] - 10https://gerrit.wikimedia.org/r/709031 (https://phabricator.wikimedia.org/T287142) [12:47:25] (03PS1) 10Filippo Giunchedi: prometheus: tweak external url to reflect reality [puppet] - 10https://gerrit.wikimedia.org/r/709032 (https://phabricator.wikimedia.org/T284213) [12:48:11] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30449/console" [puppet] - 10https://gerrit.wikimedia.org/r/709014 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [12:50:02] (03PS2) 10Filippo Giunchedi: alertmanager: allow alerts from grafana and thanos-query [puppet] - 10https://gerrit.wikimedia.org/r/709030 (https://phabricator.wikimedia.org/T287142) [12:50:04] (03PS2) 10Filippo Giunchedi: pontoon: allow grafana and thanos to send alerts to am [puppet] - 10https://gerrit.wikimedia.org/r/709031 (https://phabricator.wikimedia.org/T287142) [12:50:06] (03PS2) 10Filippo Giunchedi: prometheus: tweak external url to reflect reality [puppet] - 10https://gerrit.wikimedia.org/r/709032 (https://phabricator.wikimedia.org/T284213) [12:52:21] (03CR) 10Mepps: "The code change looks perfect. Just needs rebasing and cleaning up the commit message to one change id." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708832 (https://phabricator.wikimedia.org/T287511) (owner: 10Eigyan) [12:53:00] (03PS1) 10Jbond: P:puppetdb: add script to automatically clean down the stockpile dir [puppet] - 10https://gerrit.wikimedia.org/r/709034 (https://phabricator.wikimedia.org/T263578) [12:53:45] (03CR) 10jerkins-bot: [V: 04-1] P:puppetdb: add script to automatically clean down the stockpile dir [puppet] - 10https://gerrit.wikimedia.org/r/709034 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [12:58:41] (03CR) 10Filippo Giunchedi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/709030 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [12:58:50] (03CR) 10Filippo Giunchedi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/709032 (https://phabricator.wikimedia.org/T284213) (owner: 10Filippo Giunchedi) [13:01:37] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: add query-url to rule [puppet] - 10https://gerrit.wikimedia.org/r/709029 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [13:02:49] (03PS2) 10Jbond: P:puppetdb: add script to automatically clean down the stockpile dir [puppet] - 10https://gerrit.wikimedia.org/r/709034 (https://phabricator.wikimedia.org/T263578) [13:03:04] (03CR) 10Filippo Giunchedi: "Note this will restart Prometheus, I'll do a staggered rollout next week" [puppet] - 10https://gerrit.wikimedia.org/r/709032 (https://phabricator.wikimedia.org/T284213) (owner: 10Filippo Giunchedi) [13:03:27] (03CR) 10jerkins-bot: [V: 04-1] P:puppetdb: add script to automatically clean down the stockpile dir [puppet] - 10https://gerrit.wikimedia.org/r/709034 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [13:03:40] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30452/console" [puppet] - 10https://gerrit.wikimedia.org/r/709034 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [13:07:22] (03CR) 10Muehlenhoff: P:puppetdb: add script to automatically clean down the stockpile dir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709034 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [13:09:36] (03CR) 10Elukey: [V: 03+1 C: 03+2] "Pcc looks good, basically only global configs are rendered." [puppet] - 10https://gerrit.wikimedia.org/r/709014 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [13:09:44] (03PS2) 10Elukey: Add ml-serve-{eqiad,codfw} to kubernetes_clusters [puppet] - 10https://gerrit.wikimedia.org/r/709014 (https://phabricator.wikimedia.org/T272919) [13:10:13] (03PS1) 10Giuseppe Lavagetto: Bugfix: accept full schema in docker configuration [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/709036 [13:11:19] (03PS3) 10Jbond: P:puppetdb: add script to automatically clean down the stockpile dir [puppet] - 10https://gerrit.wikimedia.org/r/709034 (https://phabricator.wikimedia.org/T263578) [13:12:27] (03PS4) 10Jbond: P:puppetdb: add script to automatically clean down the stockpile dir [puppet] - 10https://gerrit.wikimedia.org/r/709034 (https://phabricator.wikimedia.org/T263578) [13:12:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Bugfix: accept full schema in docker configuration [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/709036 (owner: 10Giuseppe Lavagetto) [13:12:54] (03CR) 10Jbond: P:puppetdb: add script to automatically clean down the stockpile dir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709034 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [13:14:48] (03Merged) 10jenkins-bot: Bugfix: accept full schema in docker configuration [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/709036 (owner: 10Giuseppe Lavagetto) [13:16:42] (03PS2) 10Eigyan: wmf-config: Restore logging for mediamoderation script to better understand high error rate occurring when running script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708815 (https://phabricator.wikimedia.org/T287511) [13:17:32] (03PS1) 10Giuseppe Lavagetto: Relase new version 0.0.13-1 [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/709037 [13:17:36] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/statograph] - 10https://gerrit.wikimedia.org/r/709028 (owner: 10Cathal Mooney) [13:18:11] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Relase new version 0.0.13-1 [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/709037 (owner: 10Giuseppe Lavagetto) [13:19:02] (03PS3) 10Eigyan: wmf-config: Restore logging for mediamoderation script to better understand high error rate occurring when running script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708815 (https://phabricator.wikimedia.org/T287511) [13:22:12] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw1439.eqiad.wmnet [13:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:26] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw1440.eqiad.wmnet [13:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:03] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw144[5-6].eqiad.wmnet [13:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:22] (03PS1) 10Btullis: Fix presto services in the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/709038 (https://phabricator.wikimedia.org/T273642) [13:24:58] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/709034 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [13:26:08] !log uploaded docker-report 0.0.13 to buster [13:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:21] (03CR) 10Elukey: [C: 03+1] Fix presto services in the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/709038 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [13:33:01] (03CR) 10Btullis: [C: 03+2] Fix presto services in the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/709038 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [13:33:30] (03CR) 10Jbond: [C: 03+2] P:puppetdb: add script to automatically clean down the stockpile dir [puppet] - 10https://gerrit.wikimedia.org/r/709034 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [13:36:58] (03PS1) 10Dzahn: site/conftool: convert 4 appservers to jobrunners in row D for balance [puppet] - 10https://gerrit.wikimedia.org/r/709041 (https://phabricator.wikimedia.org/T279309) [13:41:39] (03PS1) 10Giuseppe Lavagetto: deploy-mwdebug: several bugfixes [puppet] - 10https://gerrit.wikimedia.org/r/709042 [13:41:53] (03PS2) 10Giuseppe Lavagetto: deploy-mwdebug: several bugfixes [puppet] - 10https://gerrit.wikimedia.org/r/709042 [13:42:07] (03PS2) 10Dzahn: site/conftool: convert 4 appservers to jobrunners in row D for balance [puppet] - 10https://gerrit.wikimedia.org/r/709041 (https://phabricator.wikimedia.org/T279309) [13:44:55] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) [13:46:15] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) [13:49:03] (03PS1) 10Ottomata: airflow::instance - use force => true when ensuring directories [puppet] - 10https://gerrit.wikimedia.org/r/709044 (https://phabricator.wikimedia.org/T284172) [13:49:28] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/709041 (https://phabricator.wikimedia.org/T279309) (owner: 10Dzahn) [13:50:42] (03PS1) 10Btullis: Fix the discovery URL for presto clients in the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/709045 (https://phabricator.wikimedia.org/T273642) [13:51:36] (03CR) 10Dzahn: [C: 03+2] site/conftool: convert 4 appservers to jobrunners in row D for balance [puppet] - 10https://gerrit.wikimedia.org/r/709041 (https://phabricator.wikimedia.org/T279309) (owner: 10Dzahn) [13:51:59] (03CR) 10Ottomata: admin README - convert to markdown and clarify system user/group docs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708777 (owner: 10Ottomata) [13:52:01] (03CR) 10Elukey: [C: 03+1] Fix the discovery URL for presto clients in the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/709045 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [13:52:20] (03CR) 10Btullis: [C: 03+2] Fix the discovery URL for presto clients in the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/709045 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [13:52:37] PROBLEM - mediawiki-installation DSH group on mw1445 is CRITICAL: Host mw1445 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [13:53:05] me, fixing downtimes [13:53:20] (03PS2) 10Ottomata: airflow::instance - use force => true when ensuring directories [puppet] - 10https://gerrit.wikimedia.org/r/709044 (https://phabricator.wikimedia.org/T284172) [13:53:28] (03CR) 10Jbond: admin README - convert to markdown and clarify system user/group docs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708777 (owner: 10Ottomata) [13:53:30] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mw[1445-1446].eqiad.wmnet with reason: reimage [13:53:31] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mw[1445-1446].eqiad.wmnet with reason: reimage [13:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:40] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mw[1439-1440].eqiad.wmnet with reason: reimage [13:53:41] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mw[1439-1440].eqiad.wmnet with reason: reimage [13:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:14] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30454/console" [puppet] - 10https://gerrit.wikimedia.org/r/709044 (https://phabricator.wikimedia.org/T284172) (owner: 10Ottomata) [13:55:49] (03CR) 10Ottomata: [V: 03+1 C: 03+2] airflow::instance - use force => true when ensuring directories [puppet] - 10https://gerrit.wikimedia.org/r/709044 (https://phabricator.wikimedia.org/T284172) (owner: 10Ottomata) [13:56:52] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ['mw1439.eqiad.wmnet', 'mw1440.eqiad.wmnet', 'mw1445.eqiad.... [13:57:13] !log mw1439,mw1440,mw1445,mw1446 - converting from app/API to jobrunners - reimaging for row balance in eqiad [13:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:36] 10SRE, 10ops-codfw: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 (10Papaul) @wkandek thank you for the heads up [14:00:44] (03PS2) 10David Caro: prometheus.icinga-exporter-am: support --labels.team.config-file [puppet] - 10https://gerrit.wikimedia.org/r/708521 [14:00:48] (03CR) 10David Caro: prometheus.icinga-exporter-am: support --labels.team.config-file (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/708521 (owner: 10David Caro) [14:00:52] (03PS1) 10David Caro: profile.icinga_exporter: Add label_teams_config_file param [puppet] - 10https://gerrit.wikimedia.org/r/709052 [14:00:56] (03PS1) 10David Caro: prometheus.icinga_exporter: Add label_teams_config parameter [puppet] - 10https://gerrit.wikimedia.org/r/709053 [14:01:00] (03PS1) 10David Caro: profile.icinga_exporter: Added label_teams_config parameter [puppet] - 10https://gerrit.wikimedia.org/r/709054 [14:01:42] (03PS2) 10David Caro: profile.icinga_exporter: Added label_teams_config parameter [puppet] - 10https://gerrit.wikimedia.org/r/709054 [14:01:49] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/709054 (owner: 10David Caro) [14:12:55] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1439.eqiad.wmnet with reason: REIMAGE [14:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:55] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1440.eqiad.wmnet with reason: REIMAGE [14:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:10] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1439.eqiad.wmnet with reason: REIMAGE [14:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:53] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1445.eqiad.wmnet with reason: REIMAGE [14:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:24] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1440.eqiad.wmnet with reason: REIMAGE [14:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:57] RECOVERY - Check systemd state on logstash1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:53] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1446.eqiad.wmnet with reason: REIMAGE [14:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:32] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1445.eqiad.wmnet with reason: REIMAGE [14:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:26] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1446.eqiad.wmnet with reason: REIMAGE [14:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:07] (03PS1) 10Jdlrobson: Styling fixes for mobile visual editor (and editor loading overlay) [extensions/MobileFrontend] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708990 (https://phabricator.wikimedia.org/T287528) [14:33:45] (03CR) 10Herron: [C: 03+1] alertmanager: allow alerts from grafana and thanos-query [puppet] - 10https://gerrit.wikimedia.org/r/709030 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [14:34:29] (03CR) 10Herron: [C: 03+1] hieradata: add 'role' for prometheus service [puppet] - 10https://gerrit.wikimedia.org/r/709010 (owner: 10Filippo Giunchedi) [14:39:48] !log Setting up BGP peering to Xiber LLC AS393950 on cr2-eqord, Equinix Chicago exchange. [14:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:55] (03PS2) 10Muehlenhoff: os-updates-report: Adapt to new OS tracking (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/707371 [14:40:19] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1439.eqiad.wmnet', 'mw1440.eqiad.wmnet', 'mw1445.eqiad.wmnet', 'mw1446.eqiad.wmnet'] ` and were **ALL**... [14:41:18] (03CR) 10jerkins-bot: [V: 04-1] os-updates-report: Adapt to new OS tracking (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/707371 (owner: 10Muehlenhoff) [14:42:22] (03PS1) 10Dzahn: site: remove already installed servers from insetup regex [puppet] - 10https://gerrit.wikimedia.org/r/709064 (https://phabricator.wikimedia.org/T279309) [14:44:36] 10ops-codfw: codfw: Ship back Raritan test PDU - https://phabricator.wikimedia.org/T287762 (10Papaul) [14:44:50] 10ops-codfw: codfw: Ship back Raritan test PDU - https://phabricator.wikimedia.org/T287762 (10Papaul) p:05Triage→03Medium [14:46:05] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=mw1439.eqiad.wmnet [14:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:11] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=mw1440.eqiad.wmnet [14:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:20] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=mw144[5-6].eqiad.wmnet [14:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:58] 10SRE, 10ops-codfw: codfw: Ship back Raritan test PDU - https://phabricator.wikimedia.org/T287762 (10Papaul) p:05Medium→03Low [14:50:18] (03CR) 10Herron: [C: 03+1] prometheus: tweak external url to reflect reality [puppet] - 10https://gerrit.wikimedia.org/r/709032 (https://phabricator.wikimedia.org/T284213) (owner: 10Filippo Giunchedi) [14:52:37] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1439.eqiad.wmnet [14:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:01] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1440.eqiad.wmnet [14:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:04] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw144[5-6].eqiad.wmnet [14:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:51] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/30455/mw1426.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/709064 (https://phabricator.wikimedia.org/T279309) (owner: 10Dzahn) [14:57:02] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10User-jbond: Add logout.d script for Kerberos - https://phabricator.wikimedia.org/T287763 (10MoritzMuehlenhoff) [14:57:26] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10User-jbond: Add logout.d script for Kerberos - https://phabricator.wikimedia.org/T287763 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff [14:58:37] !log mw1439, mw1440, mw1445, mw1446 - scap pull, repool as jobrunners after reimaging [14:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:38] (03PS1) 10Majavah: metricsinfra: add karma with cas [puppet] - 10https://gerrit.wikimedia.org/r/709066 (https://phabricator.wikimedia.org/T285055) [15:04:38] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:46] (03PS1) 10Dzahn: site/conftool: decom jobrunners: mw1295,mw1296,mw1298,mw1299 [puppet] - 10https://gerrit.wikimedia.org/r/709068 (https://phabricator.wikimedia.org/T280203) [15:07:19] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw[1295-1296].eqiad.wmnet [15:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:48] ACKNOWLEDGEMENT - Ensure local MW versions match expected deployment on mw2383 is CRITICAL: CRITICAL: 976 mismatched wikiversions daniel_zahn T286463 https://wikitech.wikimedia.org/wiki/Application_servers [15:10:48] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2383 is CRITICAL: Host mw2383 is not in mediawiki-installation dsh group daniel_zahn T286463 https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:12:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:47] (03CR) 10Giuseppe Lavagetto: [C: 03+2] deploy-mwdebug: several bugfixes [puppet] - 10https://gerrit.wikimedia.org/r/709042 (owner: 10Giuseppe Lavagetto) [15:17:33] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw[1295-1296].eqiad.wmnet [15:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:42] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[1295-1296].eqiad.wmn... [15:19:14] (03CR) 10Mepps: [C: 03+1] "Looks great to me but I don't have +2 permissions on this repo." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708815 (https://phabricator.wikimedia.org/T287511) (owner: 10Eigyan) [15:19:29] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw129[5-6].eqiad.wmnet [15:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:47] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw129[8-9].eqiad.wmnet [15:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:26] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw[1298-1299].eqiad.wmnet [15:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:55] (03CR) 10Ppchelko: [C: 03+1] wmf-config: Restore logging for mediamoderation script to better understand high error rate occurring when running script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708815 (https://phabricator.wikimedia.org/T287511) (owner: 10Eigyan) [15:25:03] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10serviceops: Ensure the code is deployed to mediawiki on k8s when it is deployed to production - https://phabricator.wikimedia.org/T287570 (10Joe) p:05Triage→03High [15:25:19] (03PS1) 10Giuseppe Lavagetto: mwdebug: also source /etc/helmfile-defaults/mediawiki/releases.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/709069 (https://phabricator.wikimedia.org/T287570) [15:29:30] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: also source /etc/helmfile-defaults/mediawiki/releases.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/709069 (https://phabricator.wikimedia.org/T287570) (owner: 10Giuseppe Lavagetto) [15:30:48] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:57] (03CR) 10Bstorm: [C: 03+2] cloud dns: tidy up the labs-ip-alias-dump script [puppet] - 10https://gerrit.wikimedia.org/r/707478 (https://phabricator.wikimedia.org/T285537) (owner: 10Bstorm) [15:33:10] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw[1298-1299].eqiad.wmnet [15:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:19] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[1298-1299].eqiad.wmn... [15:44:43] (03CR) 10Dzahn: [C: 03+2] site/conftool: decom jobrunners: mw1295,mw1296,mw1298,mw1299 [puppet] - 10https://gerrit.wikimedia.org/r/709068 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [15:46:07] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) [15:49:16] (03PS1) 10Giuseppe Lavagetto: deploy-mwdebug: distinguish between mw and web tag [puppet] - 10https://gerrit.wikimedia.org/r/709075 [15:49:42] (03PS2) 10Giuseppe Lavagetto: deploy-mwdebug: distinguish between mw and web tag [puppet] - 10https://gerrit.wikimedia.org/r/709075 [15:50:29] (03CR) 10Giuseppe Lavagetto: [C: 03+2] deploy-mwdebug: distinguish between mw and web tag [puppet] - 10https://gerrit.wikimedia.org/r/709075 (owner: 10Giuseppe Lavagetto) [15:58:04] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:46] (03PS1) 10Giuseppe Lavagetto: deploy-mwdebug: fix the regex [puppet] - 10https://gerrit.wikimedia.org/r/709076 [15:59:14] (03CR) 10jerkins-bot: [V: 04-1] deploy-mwdebug: fix the regex [puppet] - 10https://gerrit.wikimedia.org/r/709076 (owner: 10Giuseppe Lavagetto) [15:59:16] (03CR) 10Thcipriani: "> Patch Set 2:" [dns] - 10https://gerrit.wikimedia.org/r/708874 (owner: 10Thcipriani) [16:00:19] (03PS2) 10Giuseppe Lavagetto: deploy-mwdebug: fix the regex [puppet] - 10https://gerrit.wikimedia.org/r/709076 [16:03:03] (03PS3) 10Giuseppe Lavagetto: deploy-mwdebug: fix the regex [puppet] - 10https://gerrit.wikimedia.org/r/709076 [16:03:42] how many dumb mistakes can one man make in writing a simple script? apparently a lot [16:09:02] (03CR) 10Giuseppe Lavagetto: [C: 03+2] deploy-mwdebug: fix the regex [puppet] - 10https://gerrit.wikimedia.org/r/709076 (owner: 10Giuseppe Lavagetto) [16:11:00] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:14] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:39] are we continuously deploying containers now? :D [16:16:03] joe: not mistakes, just opportunities for improvement. :) [16:16:40] thcipriani: well now we are, because the first attempts failed because of the mistakes :D [16:16:53] bd808: I don't speak american corporate, sorry :) [16:17:21] * bd808 is sometimes glad that his activity on toolforge.org deploys is not automatically logged for all to see [16:20:02] thcipriani: for today, it's going to be just a script I run manually, by monday, a cronjob, by end of next week, hopefully scap runs it [16:20:19] very exciting! [16:20:24] so that scap deploy => k8s deployment [16:20:46] the script is abysmal but does what we need to re-open the demo on all sites [16:21:11] nice, it'll fit right in with the rest of our abysmal scripts :P [16:21:30] tbh most of the few bugs we already found are tied to stuff that we need to add to the images, like static assets [16:21:35] or gitinfo [16:22:00] I'm jazzed to see a very huge chunk of work go out for people to try [16:22:32] when I saw you email it felt like a very sudden leap of progress for mw-on-k8s [16:22:40] there are things we know not to work, for instance, until we move all shellouts to shellbox, they'll fail on mw-on-k8s [16:23:37] but it's all coming together. I hope we'll have a well-benchmarked deployment by the end of quarter, and that we'll have ironed out enough bugs to allow sending a bit of traffic to a non-debug installation [16:24:29] most of the things that I know are broken are being worked on (not sure about the long(?) list of shellouts), Looking forward to finding all the things we don't know re broken :) [16:25:51] sounds very exciting! ^^ [16:30:22] (03PS13) 10Elukey: WIP - Add kubeflow's kfserving charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) [16:30:58] (03PS14) 10Elukey: WIP - Add kubeflow's kfserving charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) [16:32:18] (03CR) 10jerkins-bot: [V: 04-1] WIP - Add kubeflow's kfserving charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [16:32:37] uff [16:32:40] also exciting ^^ [17:09:35] (03PS15) 10Elukey: WIP - Add kubeflow's kfserving charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) [17:23:39] (03CR) 10Hashar: [C: 04-1] "I have reviewed the plugin differences using:" [software/gerrit] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/705934 (https://phabricator.wikimedia.org/T262241) (owner: 10Hashar) [17:50:00] Hello! I am working with the AHT team to get votewiki ready for elections next week and am running into a "RdbmsDBQueryError" when trying to create an election. [17:50:15] Who's the right person/channel to talk to about this? [17:51:28] Niharika: Depends what about it you're wanting to talk about ;) [17:51:40] To fix it, silly. [17:51:42] Have you looked up the error in logstash or on a mwlog host? [17:51:49] https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-deploy-2021.07.30?id=9Ad1-HoB3UDnvh-QUvxD [17:52:24] something about 'altwiki' I see [17:53:40] that is possibly platform or performance (Aaron) [17:53:54] looks like MW is confused as to where the database is, and is looking on the wrong server/cluster [17:54:29] Are you asking it to do something on altwiki (or all wikis?) on the creation form? [17:55:51] mind pasting the non-PII parts of the stacktrace somewhere us without logstash access can see it too? [17:57:07] Reedy: Nope, we are not. But I think it is possible securepoll wants that. one can vote from different wikis. [17:57:17] majavah: Gimme a minute, sorry. [17:57:55] the code is doing... [17:57:56] foreach ( $store->remoteWikis as $dbname ) { [17:57:56] $rdbw = $this->loadBalancer->getConnection( DB_PRIMARY, [], $dbname ); [17:58:44] which is populated via $this->remoteWikis = array_diff( $wikis, [ WikiMap::getCurrentWikiId() ] ); [17:59:37] And I guess the problem is due to it reusing the same loadbalancer [17:59:56] in other contexts (maintenance scripts), we do stuff like [17:59:57] $lb = $lbFactory->getMainLB( $wiki ); [17:59:57] $db = $lb->getConnection( DB_PRIMARY, [], $wiki ); [18:00:18] Hi there [18:00:23] Reedy: yes, this was a global election [18:00:32] so it would have attempted to access all SUL wikis [18:00:37] that's what CA does too https://github.com/wikimedia/mediawiki-extensions-CentralAuth/blob/36608f6bee24d2cc16f79324bf3635e660d8be3a/includes/CentralAuthUser.php#L2596 [18:01:09] I presume the loadbalancer object itself is bound to the section itself, so you need to ask the factory for a load balancer to the specific wiki you're connecting to [18:01:12] I wonder if this has been broken more recently via other core changes [18:01:22] The fix is potentially simple [18:01:56] or since beta only has one db cluster/section, no-one just ever tested it on a multi-cluster setup after touching the securepoll code [18:03:26] Reedy: I think it was introduced in https://github.com/wikimedia/mediawiki-extensions-SecurePoll/commit/71a5593dc9b4ea4ee28a75fce5381bbd19fc77c4 and not a core bug [18:03:56] Ah yeah [18:04:06] majavah: https://dpaste.org/hOwQ [18:04:08] looks like a complete misunderstanding of the code [18:04:13] https://github.com/wikimedia/mediawiki-extensions-SecurePoll/commit/71a5593dc9b4ea4ee28a75fce5381bbd19fc77c4#diff-c682d89300c58b325fe3999cb9b82ff980dd70b8fb6ad7f64a8afa22f7ffc8edL724 [18:04:36] Niharika: already got the relevant bits from reedy, but thanks [18:05:02] sorry, I was juggling a couple things [18:05:09] It looks like other bits of code around there might be also broken [18:06:09] Could revert the whole patch (dunno if it'll do it cleanly) [18:06:28] Or inject a LBFactory, and do a few line reverts [18:07:15] nope, doesn't revert cleanly [18:08:38] https://phabricator.wikimedia.org/T287780 filed [18:08:47] tzatziki: How important this is fixed "now"? [18:09:26] Reedy: We need it by Monday at least. [18:09:32] Election starts on Wednesday. [18:09:55] Ok, so we don't have to rush too much and do a friday/weekend deploy [18:10:01] Yeah, we're hoping to have things totally in place and ready by end of Tuesday so there's not a ton of time [18:10:29] majavah: Are you wanting to make a patch? [18:10:54] sure, I can try [18:11:17] the hacky way is to not inject the services... [18:11:36] the less hacky way, should be to inject the LBFactory service, and then just do a few line reverts where the LBFactory was used [18:12:50] looks like only two lines need re-instating, and the one after adjusting to fit [18:13:11] Is there a task for making the election we can put T287780 as blocking? [18:13:11] T287780: SecurePoll CreatePage can no longer correctly select "remote" wiki databases that aren't in the same cluster - https://phabricator.wikimedia.org/T287780 [18:13:35] There's another issue with Securepoll. The translation system seems to not be happy. https://dpaste.org/wbh3 [18:13:51] thanks for your help Reedy and majavah <3 [18:15:53] Niharika: Want to dump that into another ticket? [18:16:08] can do! [18:17:36] if someone else writes the patch, I'm not then having to self merge :P [18:17:52] Unfortunately we can't easily test this on beta as all the dbs are on the same db server [18:18:34] https://phabricator.wikimedia.org/T287782 [18:20:53] Reedy: entirely untested, but I think https://gerrit.wikimedia.org/r/c/mediawiki/extensions/SecurePoll/+/709089 got all of those in the action pages [18:22:27] majavah: thanks! looks good at a quick glance [18:22:36] I need to head out for a bit to get food and stuff, so will have a look properly when I'm back :) [18:22:37] not sure if it's broken outside those classes I checked, not familiar with the extension layout [18:22:55] cool. I might be off for the night already tha that point, not sure [18:23:04] looks like it's work done against a few different bugs in the different files [18:23:10] We'll see what CI has to say :) [18:23:48] I ran phan and phpcs on it beforehand :P [18:24:01] heh [18:24:10] * Reedy wonders if Niharika got people to write enough tests [18:24:11] * Reedy grins [18:24:21] XD [18:24:59] Wellll we were mostly testing the stuff we added with the STV. [18:25:03] Didn't expect this. [18:25:45] this isn't really something you could have catched on beta cluster, as there all the databases are on one cluster [18:26:10] but as I said, I'm not at all familiar with the extension, so there might be similar bugs lurking in places I didn't check [18:26:24] TallyPage and VotePage look like they need updating in a similar way [18:26:52] They use DBLoadBalancer rather than DBLoadBalancerFactory [18:27:25] phuedx: but I don't see them accessing other databases than the votewiki/local one? [18:27:47] the bug is with using the load balancer for the local/votewiki shard for databases not on that shard [18:27:53] Yeah, it should only be if it's doing "other" database queries [18:28:25] majavah: Ah. You're correct. I'd looked at the definitions in ActionPageFactory [18:42:58] (03PS1) 10Ottomata: Use +default in InitialiseSettings-labs.php for event stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709090 (https://phabricator.wikimedia.org/T287760) [18:43:38] (03PS1) 10Michael DiPietro: add mdipietro newhire to icinga contact groups [puppet] - 10https://gerrit.wikimedia.org/r/709091 (https://phabricator.wikimedia.org/T287287) [18:44:48] (03CR) 10Ottomata: [C: 03+2] Use +default in InitialiseSettings-labs.php for event stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709090 (https://phabricator.wikimedia.org/T287760) (owner: 10Ottomata) [18:46:49] (03CR) 10Bstorm: [C: 03+1] add mdipietro newhire to icinga contact groups [puppet] - 10https://gerrit.wikimedia.org/r/709091 (https://phabricator.wikimedia.org/T287287) (owner: 10Michael DiPietro) [18:47:50] (03CR) 10Andrew Bogott: [C: 03+1] "if mdipietro actually pages you then we might want to add a mdipietro-email alternative; it's fine to merge this and see how noisy things " [puppet] - 10https://gerrit.wikimedia.org/r/709091 (https://phabricator.wikimedia.org/T287287) (owner: 10Michael DiPietro) [18:49:56] (03CR) 10Mholloway: [C: 03+1] "LGTM. I'm not deploying much these days myself, so I'd recommend scheduling this for a backport deploy window (https://wikitech.wikimedia." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708653 (https://phabricator.wikimedia.org/T287652) (owner: 10Sharvaniharan) [18:53:03] (03CR) 10Michael DiPietro: [C: 03+2] add mdipietro newhire to icinga contact groups [puppet] - 10https://gerrit.wikimedia.org/r/709091 (https://phabricator.wikimedia.org/T287287) (owner: 10Michael DiPietro) [18:53:08] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:54:38] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:00:55] mdipietro: gerrit seems to show you've added a trailing new line at the end [19:07:22] (03CR) 10Ottomata: [C: 03+1] "I can deploy this for you on Monday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708653 (https://phabricator.wikimedia.org/T287652) (owner: 10Sharvaniharan) [19:07:26] (03PS8) 10Sharvaniharan: Stream config for android_notification_interaction schema Bug: T287652 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708653 (https://phabricator.wikimedia.org/T287652) [19:09:33] (03CR) 10Sharvaniharan: "> Patch Set 7:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708653 (https://phabricator.wikimedia.org/T287652) (owner: 10Sharvaniharan) [19:26:17] (03PS1) 10Legoktm: Add tokens and users for shellbox-constraints service [puppet] - 10https://gerrit.wikimedia.org/r/709097 (https://phabricator.wikimedia.org/T285104) [19:33:28] (03PS1) 10Ottomata: Declare wd_propertysuggester streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709098 (https://phabricator.wikimedia.org/T287760) [19:33:44] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/709098 [19:34:37] (03CR) 10jerkins-bot: [V: 04-1] Declare wd_propertysuggester streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709098 (https://phabricator.wikimedia.org/T287760) (owner: 10Ottomata) [19:36:40] (03PS2) 10Ottomata: Declare wd_propertysuggester streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709098 (https://phabricator.wikimedia.org/T287760) [19:37:50] (03CR) 10jerkins-bot: [V: 04-1] Declare wd_propertysuggester streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709098 (https://phabricator.wikimedia.org/T287760) (owner: 10Ottomata) [19:39:12] (03PS3) 10Ottomata: Declare wd_propertysuggester streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709098 (https://phabricator.wikimedia.org/T287760) [19:41:10] RhinosF1: gerrit does see a newline...which I don't in vim... [19:41:17] which I don't see in vim* [19:42:12] (03CR) 10Ottomata: [C: 03+2] Declare wd_propertysuggester streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709098 (https://phabricator.wikimedia.org/T287760) (owner: 10Ottomata) [19:42:25] (03CR) 10Michaelcochez: "Looks good to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709098 (https://phabricator.wikimedia.org/T287760) (owner: 10Ottomata) [19:44:03] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Declare wd_propertysuggester streams - T287760 (duration: 00m 57s) [19:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:12] T287760: stream-beta.wmflabs.org seems broken (can't see my mediawiki-create events) or anything else - https://phabricator.wikimedia.org/T287760 [19:46:31] (03PS1) 10Andrew Bogott: cloud-init firstboot: define $PUPPETLOCK [puppet] - 10https://gerrit.wikimedia.org/r/709100 [19:48:55] PROBLEM - Disk space on releases1002 is CRITICAL: DISK CRITICAL - free space: /srv/docker 0 MB (0% inode=69%): /srv/docker/overlay2/05bce19cd90531b6e983d9516fe9cdd71d3d798bc1f1ea65fd27106a664bfb8e/merged 0 MB (0% inode=69%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=releases1002&var-datasource=eqiad+prometheus/ops [20:04:21] (03PS2) 10Andrew Bogott: cloud-init firstboot: define $PUPPETLOCK [puppet] - 10https://gerrit.wikimedia.org/r/709100 [20:09:40] (03PS3) 10Andrew Bogott: cloud-init firstboot: define $PUPPETLOCK [puppet] - 10https://gerrit.wikimedia.org/r/709100 [20:10:27] (03CR) 10Andrew Bogott: [C: 03+2] cloud-init firstboot: define $PUPPETLOCK [puppet] - 10https://gerrit.wikimedia.org/r/709100 (owner: 10Andrew Bogott) [20:30:09] (03PS1) 10Ottomata: Restore wgEventStreamsDefaultSettings in InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709103 (https://phabricator.wikimedia.org/T287760) [20:31:28] (03CR) 10Ottomata: [C: 03+2] Restore wgEventStreamsDefaultSettings in InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709103 (https://phabricator.wikimedia.org/T287760) (owner: 10Ottomata) [20:39:13] !log wiping kafka jumbo cluster in deployment-prep beta [20:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:09] (03PS1) 10Legoktm: Add shellbox-constraints namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/709104 (https://phabricator.wikimedia.org/T285104) [20:57:50] (03PS1) 10Legoktm: Add tokens for shellbox-constraints service [labs/private] - 10https://gerrit.wikimedia.org/r/709106 (https://phabricator.wikimedia.org/T285104) [20:58:32] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Add tokens for shellbox-constraints service [labs/private] - 10https://gerrit.wikimedia.org/r/709106 (https://phabricator.wikimedia.org/T285104) (owner: 10Legoktm) [21:03:40] (03CR) 10Bstorm: metricsinfra: add karma with cas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709066 (https://phabricator.wikimedia.org/T285055) (owner: 10Majavah) [21:05:59] (03PS1) 10Ahmon Dancy: Disable $wmgUseTranslationNotifications in train-dev environment [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/709107 [21:06:15] (03CR) 10Ahmon Dancy: [C: 03+2] Disable $wmgUseTranslationNotifications in train-dev environment [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/709107 (owner: 10Ahmon Dancy) [21:06:58] (03Merged) 10jenkins-bot: Disable $wmgUseTranslationNotifications in train-dev environment [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/709107 (owner: 10Ahmon Dancy) [21:08:04] (03PS1) 10Legoktm: Add k8s shellbox and shellbox-constraints users [labs/private] - 10https://gerrit.wikimedia.org/r/709108 (https://phabricator.wikimedia.org/T281423) [21:08:30] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Add k8s shellbox and shellbox-constraints users [labs/private] - 10https://gerrit.wikimedia.org/r/709108 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [21:12:26] (03CR) 10Legoktm: [C: 03+2] Add tokens and users for shellbox-constraints service [puppet] - 10https://gerrit.wikimedia.org/r/709097 (https://phabricator.wikimedia.org/T285104) (owner: 10Legoktm) [21:13:10] (03CR) 10Legoktm: [C: 03+2] Add shellbox-constraints namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/709104 (https://phabricator.wikimedia.org/T285104) (owner: 10Legoktm) [21:15:01] (03CR) 10Majavah: metricsinfra: add karma with cas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709066 (https://phabricator.wikimedia.org/T285055) (owner: 10Majavah) [21:15:51] (03Merged) 10jenkins-bot: Add shellbox-constraints namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/709104 (https://phabricator.wikimedia.org/T285104) (owner: 10Legoktm) [21:18:12] 10SRE, 10Performance-Team, 10serviceops: WARNING: opcache cache-hit ratio is below 99.99% on multiple eqiad appservers and parsoid servers - https://phabricator.wikimedia.org/T287792 (10Legoktm) [21:22:01] (03PS1) 10Andrew Bogott: Fix typo in smartmon reset-failed exec name [puppet] - 10https://gerrit.wikimedia.org/r/709110 [21:22:03] (03PS1) 10Andrew Bogott: cloud-init: remove mount: entry [puppet] - 10https://gerrit.wikimedia.org/r/709111 [21:23:37] (03CR) 10Andrew Bogott: [C: 03+2] cloud-init: remove mount: entry [puppet] - 10https://gerrit.wikimedia.org/r/709111 (owner: 10Andrew Bogott) [21:23:59] (03CR) 10Andrew Bogott: [C: 03+2] Fix typo in smartmon reset-failed exec name [puppet] - 10https://gerrit.wikimedia.org/r/709110 (owner: 10Andrew Bogott) [21:26:14] 10SRE, 10Infrastructure-Foundations, 10Mail: mx1001 alerting for 2043 mails in exim queue - https://phabricator.wikimedia.org/T287793 (10Legoktm) p:05Triage→03High [21:31:51] (03PS1) 10Reedy: Use correct load balancers for remote databases [extensions/SecurePoll] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708994 (https://phabricator.wikimedia.org/T287780) [21:33:25] RECOVERY - Disk space on releases1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=releases1002&var-datasource=eqiad+prometheus/ops [21:39:33] ^ for releases1002 disk: that is /srv/docker filing up and it got cleaned [21:46:13] !log legoktm@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [21:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:56] (03CR) 10Bstorm: metricsinfra: add karma with cas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709066 (https://phabricator.wikimedia.org/T285055) (owner: 10Majavah) [21:47:27] !log legoktm@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [21:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:43] !log legoktm@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [21:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:10] !log legoktm@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [21:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:31] !log legoktm@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [21:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:10] !log legoktm@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [21:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:27] !log legoktm@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [21:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:21] !log legoktm@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [21:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:35] 10SRE, 10Services, 10Wikibase-Quality-Constraints, 10Wikidata, and 4 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Legoktm) [22:01:47] (03PS1) 10Reedy: Pass an actual user instance to $page->newPageUpdater() [extensions/SecurePoll] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708995 (https://phabricator.wikimedia.org/T287782) [22:03:21] (03PS1) 10Legoktm: Add helmfile.d for shellbox-constraints [deployment-charts] - 10https://gerrit.wikimedia.org/r/709114 (https://phabricator.wikimedia.org/T285104) [22:05:05] (03PS2) 10Legoktm: Add helmfile.d for shellbox-constraints [deployment-charts] - 10https://gerrit.wikimedia.org/r/709114 (https://phabricator.wikimedia.org/T285104) [22:06:46] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.16; 2021-07-26), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Matthiasb) As German Wiinews editor (+admin) I consider reoving DPL as a ba solution.. I... [22:22:08] !log razzi@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid test cluster: Roll restart of Druid's jvm daemons. - razzi@cumin1001 [22:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:16] !log razzi@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid test cluster: Roll restart of Druid's jvm daemons. - razzi@cumin1001 [22:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:57] (03CR) 10Legoktm: [C: 03+2] Add helmfile.d for shellbox-constraints [deployment-charts] - 10https://gerrit.wikimedia.org/r/709114 (https://phabricator.wikimedia.org/T285104) (owner: 10Legoktm) [22:38:39] (03Merged) 10jenkins-bot: Add helmfile.d for shellbox-constraints [deployment-charts] - 10https://gerrit.wikimedia.org/r/709114 (https://phabricator.wikimedia.org/T285104) (owner: 10Legoktm) [22:44:39] !log legoktm@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox-constraints' for release 'main' . [22:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:31] 10SRE, 10Services, 10Wikibase-Quality-Constraints, 10Wikidata, and 4 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Legoktm) ` $ curl https://staging.svc.eqiad.wmnet:4010/healthz { "__": "Shellbox running", "pid...