[00:04:13] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:08:34] (03PS1) 10MusikAnimal: Disable code mirror by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703633 (https://phabricator.wikimedia.org/T286270) [02:27:11] PROBLEM - SSH on wdqs1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:27:59] RECOVERY - SSH on wdqs1006.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:44:31] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:45:39] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 221, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:46:07] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:46:25] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:47:33] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:48:03] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:02:21] (03PS1) 10Marostegui: dbproxy1013,dbproxy1015: Add db1124 as failover [puppet] - 10https://gerrit.wikimedia.org/r/703634 (https://phabricator.wikimedia.org/T286042) [05:04:40] (03CR) 10Marostegui: [C: 03+2] dbproxy1013,dbproxy1015: Add db1124 as failover [puppet] - 10https://gerrit.wikimedia.org/r/703634 (https://phabricator.wikimedia.org/T286042) (owner: 10Marostegui) [05:08:33] (03PS1) 10Marostegui: dbproxy1012,dbproxy1014: Add db1125 as m1 failover [puppet] - 10https://gerrit.wikimedia.org/r/703635 (https://phabricator.wikimedia.org/T286042) [05:09:16] (03CR) 10Marostegui: [C: 03+2] dbproxy1012,dbproxy1014: Add db1125 as m1 failover [puppet] - 10https://gerrit.wikimedia.org/r/703635 (https://phabricator.wikimedia.org/T286042) (owner: 10Marostegui) [05:12:03] (03PS1) 10Marostegui: db1124, db1125: Enable notifications. [puppet] - 10https://gerrit.wikimedia.org/r/703636 (https://phabricator.wikimedia.org/T286042) [05:13:09] (03CR) 10Marostegui: [C: 03+2] db1124, db1125: Enable notifications. [puppet] - 10https://gerrit.wikimedia.org/r/703636 (https://phabricator.wikimedia.org/T286042) (owner: 10Marostegui) [05:16:00] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) [05:22:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2092', diff saved to https://phabricator.wikimedia.org/P16788 and previous config saved to /var/cache/conftool/dbconfig/20210708-052216-marostegui.json [05:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2092 (re)pooling @ 25%: Repool after index change', diff saved to https://phabricator.wikimedia.org/P16789 and previous config saved to /var/cache/conftool/dbconfig/20210708-052302-root.json [05:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2092 (re)pooling @ 50%: Repool after index change', diff saved to https://phabricator.wikimedia.org/P16790 and previous config saved to /var/cache/conftool/dbconfig/20210708-053805-root.json [05:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2092 (re)pooling @ 75%: Repool after index change', diff saved to https://phabricator.wikimedia.org/P16791 and previous config saved to /var/cache/conftool/dbconfig/20210708-055309-root.json [05:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2092 (re)pooling @ 100%: Repool after index change', diff saved to https://phabricator.wikimedia.org/P16792 and previous config saved to /var/cache/conftool/dbconfig/20210708-060812-root.json [06:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:45] PROBLEM - SSH on wdqs1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:00:05] Deploy window No deploys all week! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210708T0700) [07:07:13] (03PS1) 10Muehlenhoff: Add component for Ganeti 2.16 backport for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/703699 (https://phabricator.wikimedia.org/T284811) [07:08:44] (03CR) 10Muehlenhoff: [C: 03+2] Add component for Ganeti 2.16 backport for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/703699 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [07:14:16] (03PS1) 10Giuseppe Lavagetto: mwdebug: rationalize values, add codfw config [deployment-charts] - 10https://gerrit.wikimedia.org/r/703700 [07:17:38] https://about.gitlab.com/releases/2021/07/07/critical-security-release-gitlab-14-0-4-released/ [07:17:50] but it looks like we don't have the affected feature enabled [07:19:29] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: rationalize values, add codfw config [deployment-charts] - 10https://gerrit.wikimedia.org/r/703700 (owner: 10Giuseppe Lavagetto) [07:21:20] (03Merged) 10jenkins-bot: mwdebug: rationalize values, add codfw config [deployment-charts] - 10https://gerrit.wikimedia.org/r/703700 (owner: 10Giuseppe Lavagetto) [07:30:25] (03PS1) 10Giuseppe Lavagetto: mwdebug: use all puppet-generated values [deployment-charts] - 10https://gerrit.wikimedia.org/r/703701 [07:45:06] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: use all puppet-generated values [deployment-charts] - 10https://gerrit.wikimedia.org/r/703701 (owner: 10Giuseppe Lavagetto) [07:45:47] (03Merged) 10jenkins-bot: mwdebug: use all puppet-generated values [deployment-charts] - 10https://gerrit.wikimedia.org/r/703701 (owner: 10Giuseppe Lavagetto) [08:19:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2130', diff saved to https://phabricator.wikimedia.org/P16793 and previous config saved to /var/cache/conftool/dbconfig/20210708-081922-marostegui.json [08:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2130 (re)pooling @ 25%: Repool after index change', diff saved to https://phabricator.wikimedia.org/P16794 and previous config saved to /var/cache/conftool/dbconfig/20210708-081945-root.json [08:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:43] (03PS1) 10Giuseppe Lavagetto: mediawiki::nutcracker::yaml_defs: fix data structure [puppet] - 10https://gerrit.wikimedia.org/r/703704 [08:32:38] RECOVERY - SSH on wdqs1006.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:32:39] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30135/console" [puppet] - 10https://gerrit.wikimedia.org/r/703704 (owner: 10Giuseppe Lavagetto) [08:32:53] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki::nutcracker::yaml_defs: fix data structure [puppet] - 10https://gerrit.wikimedia.org/r/703704 (owner: 10Giuseppe Lavagetto) [08:34:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2130 (re)pooling @ 50%: Repool after index change', diff saved to https://phabricator.wikimedia.org/P16795 and previous config saved to /var/cache/conftool/dbconfig/20210708-083449-root.json [08:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:39] !log imported ganeti 2.16.0 for stretch-security/component/ganeti216 T284811 [08:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:13] T284811: Upgrade eqiad/codfw Ganeti clusters to Buster - https://phabricator.wikimedia.org/T284811 [08:49:03] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:49:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2130 (re)pooling @ 75%: Repool after index change', diff saved to https://phabricator.wikimedia.org/P16796 and previous config saved to /var/cache/conftool/dbconfig/20210708-084952-root.json [08:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:33] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:52] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2130 (re)pooling @ 100%: Repool after index change', diff saved to https://phabricator.wikimedia.org/P16797 and previous config saved to /var/cache/conftool/dbconfig/20210708-090456-root.json [09:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2116', diff saved to https://phabricator.wikimedia.org/P16798 and previous config saved to /var/cache/conftool/dbconfig/20210708-092411-marostegui.json [09:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2116 (re)pooling @ 25%: Repool after index change', diff saved to https://phabricator.wikimedia.org/P16799 and previous config saved to /var/cache/conftool/dbconfig/20210708-092436-root.json [09:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:38] !log upload golang-github-cloudflare-cfssl_1.6.0-1_amd64 to bullseye [09:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:14] <_joe_> jbond: gotta love golang package names in debian :P [09:28:13] indead supper descriptive, at least this one has a reasnable version instead of 1.2.0+git20160825.89.7fb22c8-3+deb10u1 [09:30:16] <_joe_> jbond: I'm tempted to register github.org just to mess with it :P [09:30:34] :D lol [09:30:50] <_joe_> I'm sure it's registered already, but you get my point [09:31:22] * jbond nod nod [09:31:47] (03PS1) 10Majavah: metricsinfra: Add HAProxy for distributing http traffic [puppet] - 10https://gerrit.wikimedia.org/r/703708 (https://phabricator.wikimedia.org/T286335) [09:33:11] the real idiocy of the go build system (which Debian essentially only wraps around) starts when packages get forked and move hosting sites... [09:35:05] (03CR) 10Majavah: "This has already been tested on metricsinfra-haproxy-1.metricsinfra.eqiad1.wikimedia.cloud" [puppet] - 10https://gerrit.wikimedia.org/r/703708 (https://phabricator.wikimedia.org/T286335) (owner: 10Majavah) [09:39:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2116 (re)pooling @ 50%: Repool after index change', diff saved to https://phabricator.wikimedia.org/P16800 and previous config saved to /var/cache/conftool/dbconfig/20210708-093939-root.json [09:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:20] (03PS1) 10Jbond: ocsprefresh: Add patches for multi aki in ohcsprefresh [debs/cfssl] - 10https://gerrit.wikimedia.org/r/703709 [09:44:43] (03CR) 10Jbond: [V: 03+2 C: 03+2] ocsprefresh: Add patches for multi aki in ohcsprefresh [debs/cfssl] - 10https://gerrit.wikimedia.org/r/703709 (owner: 10Jbond) [09:54:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2116 (re)pooling @ 75%: Repool after index change', diff saved to https://phabricator.wikimedia.org/P16801 and previous config saved to /var/cache/conftool/dbconfig/20210708-095443-root.json [09:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:50] (03PS1) 10Muehlenhoff: Add mx1002/mx2002 to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/703710 [10:09:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2116 (re)pooling @ 100%: Repool after index change', diff saved to https://phabricator.wikimedia.org/P16802 and previous config saved to /var/cache/conftool/dbconfig/20210708-100947-root.json [10:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:45] !log upgrade golang-cfssl [10:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:04] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_webrequest_partitions.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:26:24] (03CR) 10Muehlenhoff: [C: 03+2] Add mx1002/mx2002 to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/703710 (owner: 10Muehlenhoff) [10:53:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2103', diff saved to https://phabricator.wikimedia.org/P16803 and previous config saved to /var/cache/conftool/dbconfig/20210708-105353-marostegui.json [10:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:50] (03CR) 10H.krishna123: "Frontend webapp -- WIP" [software/bernard] - 10https://gerrit.wikimedia.org/r/703490 (https://phabricator.wikimedia.org/T285438) (owner: 10H.krishna123) [11:10:15] (03PS1) 10Jbond: hiera: add new pki cluster [puppet] - 10https://gerrit.wikimedia.org/r/703724 (https://phabricator.wikimedia.org/T286339) [11:11:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30136/console" [puppet] - 10https://gerrit.wikimedia.org/r/703724 (https://phabricator.wikimedia.org/T286339) (owner: 10Jbond) [11:11:55] (03CR) 10Jbond: [V: 03+1 C: 03+2] hiera: add new pki cluster [puppet] - 10https://gerrit.wikimedia.org/r/703724 (https://phabricator.wikimedia.org/T286339) (owner: 10Jbond) [11:22:08] (03PS1) 10Jbond: P:prometheus::ops: Add scrap configuration for PKI servers [puppet] - 10https://gerrit.wikimedia.org/r/703727 (https://phabricator.wikimedia.org/T286339) [11:23:49] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30137/console" [puppet] - 10https://gerrit.wikimedia.org/r/703727 (https://phabricator.wikimedia.org/T286339) (owner: 10Jbond) [11:26:02] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [11:28:51] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30138/console" [puppet] - 10https://gerrit.wikimedia.org/r/703727 (https://phabricator.wikimedia.org/T286339) (owner: 10Jbond) [11:38:11] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:prometheus::ops: Add scrap configuration for PKI servers [puppet] - 10https://gerrit.wikimedia.org/r/703727 (https://phabricator.wikimedia.org/T286339) (owner: 10Jbond) [12:13:56] 10SRE, 10MediaWiki-Cache, 10Platform Engineering, 10wdwb-tech, 10Patch-For-Review: APCu caches are set to expire in 2073 instead of an hour if exptime is a unix timestamp - https://phabricator.wikimedia.org/T286260 (10Addshore) [12:21:05] (03PS1) 10Jbond: R:prometheus::cluster_config: Drop ERB file [puppet] - 10https://gerrit.wikimedia.org/r/703732 [12:22:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30139/console" [puppet] - 10https://gerrit.wikimedia.org/r/703732 (owner: 10Jbond) [12:28:25] (03PS1) 10Jbond: P:prometheus::ops: manage target_path [puppet] - 10https://gerrit.wikimedia.org/r/703733 [12:29:24] (03CR) 10jerkins-bot: [V: 04-1] P:prometheus::ops: manage target_path [puppet] - 10https://gerrit.wikimedia.org/r/703733 (owner: 10Jbond) [12:33:20] (03PS2) 10Jbond: R:prometheus::cluster_config: Drop ERB file [puppet] - 10https://gerrit.wikimedia.org/r/703732 [12:36:08] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30141/console" [puppet] - 10https://gerrit.wikimedia.org/r/703732 (owner: 10Jbond) [12:38:18] (03PS3) 10Jbond: R:prometheus::cluster_config: Drop ERB file [puppet] - 10https://gerrit.wikimedia.org/r/703732 [12:38:26] !log installing openexr security updates [12:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30142/console" [puppet] - 10https://gerrit.wikimedia.org/r/703732 (owner: 10Jbond) [12:46:07] (03PS4) 10Jbond: R:prometheus::cluster_config: Drop ERB file [puppet] - 10https://gerrit.wikimedia.org/r/703732 [12:49:08] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30143/console" [puppet] - 10https://gerrit.wikimedia.org/r/703732 (owner: 10Jbond) [12:50:18] (03PS2) 10Jbond: P:prometheus::ops: manage target_path [puppet] - 10https://gerrit.wikimedia.org/r/703733 [12:50:51] (03CR) 10jerkins-bot: [V: 04-1] P:prometheus::ops: manage target_path [puppet] - 10https://gerrit.wikimedia.org/r/703733 (owner: 10Jbond) [12:51:41] (03PS3) 10Muehlenhoff: conf: Switch to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/702117 (https://phabricator.wikimedia.org/T164456) [12:52:23] (03CR) 10Jbond: [V: 03+1] "Pretty confident the diffs in pcc relate to missing data in pcc puppetdb" [puppet] - 10https://gerrit.wikimedia.org/r/703732 (owner: 10Jbond) [12:52:56] !log installing klibc security updates on buster [12:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:20] (03PS3) 10Jbond: P:prometheus::ops: manage target_path [puppet] - 10https://gerrit.wikimedia.org/r/703733 [12:59:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 25%: Repool after index change', diff saved to https://phabricator.wikimedia.org/P16804 and previous config saved to /var/cache/conftool/dbconfig/20210708-125910-root.json [12:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:08] (03PS1) 10Giuseppe Lavagetto: mediawiki: allow exposing http and https at the same time [deployment-charts] - 10https://gerrit.wikimedia.org/r/703739 [13:01:09] !log otto@deploy1002 Started deploy [analytics/refinery@2d4c645]: Make gobblin-netflow use production directory - T271232 [13:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:15] (03PS1) 10Jbond: P:prometheus::ops: drop the site parameter [puppet] - 10https://gerrit.wikimedia.org/r/703740 [13:01:16] T271232: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 [13:01:36] (03CR) 10Ottomata: Gobblinize refine_netflow job (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/703623 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [13:01:58] (03Abandoned) 10Jbond: P:prometheus::ops: manage target_path [puppet] - 10https://gerrit.wikimedia.org/r/703733 (owner: 10Jbond) [13:04:31] !log otto@deploy1002 Finished deploy [analytics/refinery@2d4c645]: Make gobblin-netflow use production directory - T271232 (duration: 03m 22s) [13:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:48] (03PS2) 10Ottomata: Gobblinize netflow Hadoop ingestion [puppet] - 10https://gerrit.wikimedia.org/r/703623 (https://phabricator.wikimedia.org/T271232) [13:06:16] (03CR) 10Ottomata: [C: 03+2] Gobblinize netflow Hadoop ingestion [puppet] - 10https://gerrit.wikimedia.org/r/703623 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [13:07:28] (03PS1) 10Jbond: P:prometheus::ops: manage target_path [puppet] - 10https://gerrit.wikimedia.org/r/703741 [13:11:54] (03PS5) 10Jbond: R:prometheus::cluster_config: Drop ERB file [puppet] - 10https://gerrit.wikimedia.org/r/703732 [13:14:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 50%: Repool after index change', diff saved to https://phabricator.wikimedia.org/P16805 and previous config saved to /var/cache/conftool/dbconfig/20210708-131414-root.json [13:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:54] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30146/console" [puppet] - 10https://gerrit.wikimedia.org/r/703732 (owner: 10Jbond) [13:17:00] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: allow exposing http and https at the same time [deployment-charts] - 10https://gerrit.wikimedia.org/r/703739 (owner: 10Giuseppe Lavagetto) [13:19:39] (03Merged) 10jenkins-bot: mediawiki: allow exposing http and https at the same time [deployment-charts] - 10https://gerrit.wikimedia.org/r/703739 (owner: 10Giuseppe Lavagetto) [13:22:35] (03PS2) 10Jbond: P:prometheus::ops: drop the site parameter [puppet] - 10https://gerrit.wikimedia.org/r/703740 [13:23:47] (03PS3) 10Jbond: P:prometheus::ops: drop the site parameter [puppet] - 10https://gerrit.wikimedia.org/r/703740 [13:26:43] (03PS1) 10Ottomata: Use hive/gobblin date path format for netflow data purge job [puppet] - 10https://gerrit.wikimedia.org/r/703742 (https://phabricator.wikimedia.org/T271232) [13:27:13] (03CR) 10jerkins-bot: [V: 04-1] Use hive/gobblin date path format for netflow data purge job [puppet] - 10https://gerrit.wikimedia.org/r/703742 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [13:28:08] (03CR) 10Joal: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/703742 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [13:28:14] (03PS2) 10Ottomata: Use hive/gobblin date path format for netflow data purge job [puppet] - 10https://gerrit.wikimedia.org/r/703742 (https://phabricator.wikimedia.org/T271232) [13:28:29] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30148/console" [puppet] - 10https://gerrit.wikimedia.org/r/703742 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [13:29:09] I'm seeing DNS issues with my ISP in Brazil [13:29:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 75%: Repool after index change', diff saved to https://phabricator.wikimedia.org/P16806 and previous config saved to /var/cache/conftool/dbconfig/20210708-132917-root.json [13:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:31] (03PS3) 10Ottomata: Use hive/gobblin date path format for netflow data purge job [puppet] - 10https://gerrit.wikimedia.org/r/703742 (https://phabricator.wikimedia.org/T271232) [13:29:40] (03PS4) 10Jbond: P:prometheus::ops: drop the site parameter [puppet] - 10https://gerrit.wikimedia.org/r/703740 [13:29:41] https://www.irccloud.com/pastebin/kSIGvfDq/ [13:30:27] (03CR) 10Ottomata: [C: 03+2] Use hive/gobblin date path format for netflow data purge job [puppet] - 10https://gerrit.wikimedia.org/r/703742 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [13:32:42] That ISP is Vivo (telefônica), one of the largest in Brazil, affecting both mobile and broadband residential connections. [13:32:52] (for me, at least) [13:34:11] (I do have a second ISP and their dns resolvers are fine) [13:34:21] (03PS5) 10Jbond: P:prometheus::ops: drop the site parameter [puppet] - 10https://gerrit.wikimedia.org/r/703740 [13:36:06] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [13:36:08] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30150/console" [puppet] - 10https://gerrit.wikimedia.org/r/703740 (owner: 10Jbond) [13:37:06] (03PS6) 10Jbond: P:prometheus::ops: drop the site parameter [puppet] - 10https://gerrit.wikimedia.org/r/703740 [13:37:16] (03PS2) 10Jbond: P:prometheus::ops: manage target_path [puppet] - 10https://gerrit.wikimedia.org/r/703741 [13:39:55] (03CR) 10Jbond: "gone down a bit if a rabbit whole feel free to just reject this out right if its not useful" [puppet] - 10https://gerrit.wikimedia.org/r/703740 (owner: 10Jbond) [13:40:11] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30151/console" [puppet] - 10https://gerrit.wikimedia.org/r/703732 (owner: 10Jbond) [13:40:13] (03CR) 10Jbond: "gone down a bit if a rabbit whole feel free to just reject this out right if its not useful" [puppet] - 10https://gerrit.wikimedia.org/r/703741 (owner: 10Jbond) [13:41:40] lots of chatter on Twitter about Wikipedia being offline due to it. herron not sure if there is anything to be done here, but maybe be aware Brazilians may complain due to this. [13:41:52] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [13:44:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 100%: Repool after index change', diff saved to https://phabricator.wikimedia.org/P16807 and previous config saved to /var/cache/conftool/dbconfig/20210708-134421-root.json [13:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:40] <_joe_> chicocvenancio: if you run dig against our dns servers you get a good response? [13:47:02] <_joe_> chicocvenancio: not many people from wmf around sorry, I was in the middle of doing a benchmark [13:47:22] agains 8.8.8.8 and against my other ISP I get a good response. [13:47:42] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [13:47:55] <_joe_> chicocvenancio: ok just to try to be sure, what is the response, from the bad isp's DNS, for dig -t NS wikipedia.org > [13:48:13] <_joe_> or, do you have the IP of that dns server? [13:48:24] <_joe_> anyways, I'd say they do have an issue, not us [13:49:12] <_joe_> I would also check if it's just wikipedia.org or also other domains like wikisource etc [13:49:16] https://www.irccloud.com/pastebin/BVpmFix5/ [13:49:32] <_joe_> ok something's very, very broken [13:49:43] <_joe_> on their side, I'd say [13:50:04] <_joe_> not much we can do, as far as I can tell [13:50:04] dns's 200.175.5.139 200.175.89.139 [13:50:32] <_joe_> ofc they don't allow me to run queries [13:51:40] _joe_: yeah, I figured there is nothing to do here. [13:52:49] <_joe_> chicocvenancio: last test; from the network of that isp, what's the response to "dig en.wikipedia.org @208.80.154.238" ? [13:53:58] lgtm [13:54:00] https://www.irccloud.com/pastebin/DrSR9ig5/ [13:54:15] <_joe_> sol it's not even a network problem [13:54:22] <_joe_> as in internet-wide routing [13:54:26] nope, only dns [13:54:32] <_joe_> yeah, sorry, not much else I can do :/ [14:04:58] (03PS1) 10Ottomata: Run gobblin netflow at 5 and 35 after the hour [puppet] - 10https://gerrit.wikimedia.org/r/703750 (https://phabricator.wikimedia.org/T286343) [14:05:12] (03PS2) 10Ottomata: Run gobblin netflow at 5 and 35 after the hour [puppet] - 10https://gerrit.wikimedia.org/r/703750 (https://phabricator.wikimedia.org/T286343) [14:05:41] (03CR) 10Ottomata: "My systemd timer syntax could be wrong, but I think this is correct." [puppet] - 10https://gerrit.wikimedia.org/r/703750 (https://phabricator.wikimedia.org/T286343) (owner: 10Ottomata) [14:06:23] (03CR) 10Muehlenhoff: "While this would work, I would prefer a systemd native solution, update-rc.d might just as well go away at some point. (What's in policy d" [puppet] - 10https://gerrit.wikimedia.org/r/701538 (owner: 10Jbond) [14:07:10] (03CR) 10Joal: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/703750 (https://phabricator.wikimedia.org/T286343) (owner: 10Ottomata) [14:08:04] (03PS3) 10Ottomata: Run gobblin netflow at 5 and 35 after the hour [puppet] - 10https://gerrit.wikimedia.org/r/703750 (https://phabricator.wikimedia.org/T286343) [14:10:43] (03PS4) 10Ottomata: Run gobblin netflow at 5 and 35 after the hour [puppet] - 10https://gerrit.wikimedia.org/r/703750 (https://phabricator.wikimedia.org/T286343) [14:14:03] (03CR) 10Ottomata: [C: 03+2] Run gobblin netflow at 5 and 35 after the hour [puppet] - 10https://gerrit.wikimedia.org/r/703750 (https://phabricator.wikimedia.org/T286343) (owner: 10Ottomata) [14:22:42] (03PS6) 10Ottomata: Add a consumers.analytics-hadoop setting to automate ingestion of streams into HDFS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668124 (https://phabricator.wikimedia.org/T273901) [14:25:08] 10SRE, 10MW-on-K8s, 10serviceops, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10Joe) I've deployed the mwdebug deployment to all clusters; for now it's running pinned to kubernetes1017/kubernetes2017 via a `nodeSelector` constraint, so... [14:31:29] 10SRE, 10MW-on-K8s, 10serviceops: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10Joe) In order to run some tests on the mwdebug deployment in codfw, I first exposed http on port 8444 and then ran: ` ab -n 1000 -c 2 -H 'X-Forwarded-Proto: https' -H 'Host: en.wikiped... [14:42:07] (03PS7) 10Ottomata: Add a consumers.analytics-hadoop setting to automate ingestion of streams into HDFS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668124 (https://phabricator.wikimedia.org/T273901) [14:47:11] (03CR) 10Joal: [C: 03+1] "Thank you for the renaming - LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668124 (https://phabricator.wikimedia.org/T273901) (owner: 10Ottomata) [14:47:39] (03CR) 10Ottomata: [C: 03+2] Add a consumers.analytics-hadoop setting to automate ingestion of streams into HDFS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668124 (https://phabricator.wikimedia.org/T273901) (owner: 10Ottomata) [14:52:29] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Add consumers.analytics_hadoop-ingestion stream config settings for automated gobblin imports - T271232 T273901 (duration: 01m 09s) [14:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:39] T273901: Automate event stream ingestion into HDFS for streams that don't use EventGate - https://phabricator.wikimedia.org/T273901 [14:52:39] T271232: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 [15:02:54] (03CR) 10Muehlenhoff: "*facepalm* So turns out I created the file as apache2.present instead of apache2.preset..." [puppet] - 10https://gerrit.wikimedia.org/r/701538 (owner: 10Jbond) [15:05:45] !log otto@deploy1002 Started deploy [analytics/refinery@42541e6] (hadoop-test): Deploy for eventlogging_legacy gobblin migration - T271232 [15:05:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:52] T271232: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 [15:11:27] !log otto@deploy1002 Finished deploy [analytics/refinery@42541e6] (hadoop-test): Deploy for eventlogging_legacy gobblin migration - T271232 (duration: 05m 42s) [15:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:33] T271232: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 [15:13:23] (03PS1) 10Ottomata: Add gobblin job in hadoop test eventlogging_legacy_test [puppet] - 10https://gerrit.wikimedia.org/r/703755 (https://phabricator.wikimedia.org/T271232) [15:13:38] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 400 (expecting: 404): /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [15:14:30] (03CR) 10Joal: [C: 03+1] "Yay!" [puppet] - 10https://gerrit.wikimedia.org/r/703755 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [15:15:03] (03CR) 10Ottomata: [C: 03+2] Add gobblin job in hadoop test eventlogging_legacy_test [puppet] - 10https://gerrit.wikimedia.org/r/703755 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [15:15:35] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [15:18:31] (03CR) 10Muehlenhoff: "My initial tests were done on bullseye, where it works fine. Unfortunately however, on Buster and Stretch the service start it _not_ preve" [puppet] - 10https://gerrit.wikimedia.org/r/701538 (owner: 10Jbond) [15:23:36] !log otto@deploy1002 Started deploy [analytics/refinery@51f4696] (hadoop-test): Deploy for eventlogging_legacy gobblin with final import path - T271232 [15:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:43] T271232: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 [15:26:48] (03PS1) 10Giuseppe Lavagetto: mediawiki: fix scrape uri for apache exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/703758 [15:29:04] !log otto@deploy1002 Finished deploy [analytics/refinery@51f4696] (hadoop-test): Deploy for eventlogging_legacy gobblin with final import path - T271232 (duration: 05m 27s) [15:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:15] T271232: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 [15:30:54] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [15:34:40] !log otto@deploy1002 Started deploy [analytics/refinery@9883dbf] (hadoop-test): Deploy for event_default_test job in hadoop test - T271232 [15:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:47] T271232: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 [15:35:03] (03PS1) 10Ottomata: Add gobblin job in hadoop test event_default_test [puppet] - 10https://gerrit.wikimedia.org/r/703761 (https://phabricator.wikimedia.org/T271232) [15:36:30] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [15:37:19] Update, the dns issue in Brazil's Vivo is with all .org domains. [15:37:46] !log otto@deploy1002 Finished deploy [analytics/refinery@9883dbf] (hadoop-test): Deploy for event_default_test job in hadoop test - T271232 (duration: 03m 06s) [15:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:53] chicocvenancio: is there a way to get status updates? [15:43:11] (03CR) 10Joal: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/703761 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [15:43:27] (03CR) 10Ottomata: [C: 03+2] Add gobblin job in hadoop test event_default_test [puppet] - 10https://gerrit.wikimedia.org/r/703761 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [15:59:22] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [16:01:45] (03PS1) 10Ottomata: Gobblinize eventlogging_legacy_test Refine and data purge jobs [puppet] - 10https://gerrit.wikimedia.org/r/703766 (https://phabricator.wikimedia.org/T271232) [16:02:14] (03CR) 10jerkins-bot: [V: 04-1] Gobblinize eventlogging_legacy_test Refine and data purge jobs [puppet] - 10https://gerrit.wikimedia.org/r/703766 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [16:03:15] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [16:03:55] (03CR) 10Joal: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/703766 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [16:18:30] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:34:04] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: fix scrape uri for apache exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/703758 (owner: 10Giuseppe Lavagetto) [16:36:57] (03Merged) 10jenkins-bot: mediawiki: fix scrape uri for apache exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/703758 (owner: 10Giuseppe Lavagetto) [16:39:58] (03PS1) 10Legoktm: shellbox: Fix scrape URI for apache exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/703778 [16:40:00] (03PS1) 10Legoktm: scaffold: Fix scrape URI for apache exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/703779 [16:44:06] (03CR) 10Giuseppe Lavagetto: [C: 03+1] shellbox: Fix scrape URI for apache exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/703778 (owner: 10Legoktm) [16:44:10] !log joal@deploy1002 Started deploy [analytics/refinery@51a73f1]: Analytics deploy for Gobblin replacing Camus - an-launcher1002 only [analytics/refinery@51a73f1] [16:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:18] (03CR) 10Giuseppe Lavagetto: [C: 03+1] scaffold: Fix scrape URI for apache exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/703779 (owner: 10Legoktm) [16:45:08] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:47:14] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [16:47:27] !log joal@deploy1002 Finished deploy [analytics/refinery@51a73f1]: Analytics deploy for Gobblin replacing Camus - an-launcher1002 only [analytics/refinery@51a73f1] (duration: 03m 17s) [16:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:04] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [16:55:32] so looks like prod mx*.wikimedia.org handle mail for wmflabs.org. can I ask where root@wmflabs.org routes? [16:56:54] !log joal@deploy1002 Started deploy [analytics/refinery@51a73f1] (hadoop-test): Analytics deploy for Gobblin replacing Camus - hadoop-test [analytics/refinery@51a73f1] [16:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:00] I don't think it ends up at root@wm.o [17:02:32] !log joal@deploy1002 Finished deploy [analytics/refinery@51a73f1] (hadoop-test): Analytics deploy for Gobblin replacing Camus - hadoop-test [analytics/refinery@51a73f1] (duration: 05m 38s) [17:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:49] majavah: I'll pm you [17:28:28] (03PS2) 10Ottomata: Gobblinize eventlogging_legacy_test Refine and data purge jobs [puppet] - 10https://gerrit.wikimedia.org/r/703766 (https://phabricator.wikimedia.org/T271232) [17:28:55] (03CR) 10jerkins-bot: [V: 04-1] Gobblinize eventlogging_legacy_test Refine and data purge jobs [puppet] - 10https://gerrit.wikimedia.org/r/703766 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [17:31:12] (03PS3) 10Ottomata: Gobblinize eventlogging_legacy_test Refine and data purge jobs [puppet] - 10https://gerrit.wikimedia.org/r/703766 (https://phabricator.wikimedia.org/T271232) [17:32:00] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30154/console" [puppet] - 10https://gerrit.wikimedia.org/r/703766 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [17:33:18] (03CR) 10Ottomata: [V: 03+1 C: 03+2] Gobblinize eventlogging_legacy_test Refine and data purge jobs [puppet] - 10https://gerrit.wikimedia.org/r/703766 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [17:37:53] (03PS1) 10Ottomata: Fix checksum for refinery-drop-eventlogging-legacy-raw-partitions [puppet] - 10https://gerrit.wikimedia.org/r/703782 (https://phabricator.wikimedia.org/T271232) [17:38:09] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Fix checksum for refinery-drop-eventlogging-legacy-raw-partitions [puppet] - 10https://gerrit.wikimedia.org/r/703782 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [17:44:05] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [17:45:56] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [18:06:00] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Proton [18:07:54] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [18:09:03] (03PS1) 10Ottomata: Fix for refine eventlogging_legacy_test [puppet] - 10https://gerrit.wikimedia.org/r/703783 (https://phabricator.wikimedia.org/T271232) [18:10:46] (03CR) 10Ottomata: [C: 03+2] Fix for refine eventlogging_legacy_test [puppet] - 10https://gerrit.wikimedia.org/r/703783 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [18:24:23] (03CR) 10Legoktm: Re-enable Score using Shellbox on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703489 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [18:28:56] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [18:30:54] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [18:31:18] (03PS2) 10Krinkle: [Beta Cluster] Disable wgEnableWANCacheReaper experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677731 [18:33:33] (03PS1) 10Ottomata: Gobblinize test refine_event and drop_event jobs [puppet] - 10https://gerrit.wikimedia.org/r/703786 (https://phabricator.wikimedia.org/T271232) [18:34:00] (03CR) 10jerkins-bot: [V: 04-1] Gobblinize test refine_event and drop_event jobs [puppet] - 10https://gerrit.wikimedia.org/r/703786 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [18:45:44] (03PS1) 10RLazarus: tcpircbot: Reduce max length of an IRC message. [puppet] - 10https://gerrit.wikimedia.org/r/703787 (https://phabricator.wikimedia.org/T285709) [18:48:41] (03CR) 10Legoktm: [C: 03+1] tcpircbot: Reduce max length of an IRC message. [puppet] - 10https://gerrit.wikimedia.org/r/703787 (https://phabricator.wikimedia.org/T285709) (owner: 10RLazarus) [18:48:59] (03CR) 10Legoktm: [C: 03+2] shellbox: Fix scrape URI for apache exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/703778 (owner: 10Legoktm) [18:51:02] (03PS1) 10Legoktm: Allow setting a different path for `convert` just for Score [extensions/Score] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/703652 [18:51:26] (03Merged) 10jenkins-bot: shellbox: Fix scrape URI for apache exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/703778 (owner: 10Legoktm) [18:53:22] !log legoktm@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox' for release 'main' . [18:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:06] !log legoktm@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'shellbox' for release 'main' . [18:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:40] !log legoktm@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox' for release 'main' . [18:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:36] (03CR) 10Legoktm: [C: 03+2] scaffold: Fix scrape URI for apache exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/703779 (owner: 10Legoktm) [19:04:13] (03Merged) 10jenkins-bot: scaffold: Fix scrape URI for apache exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/703779 (owner: 10Legoktm) [19:04:37] (03CR) 10Legoktm: [C: 03+2] Allow setting a different path for `convert` just for Score [extensions/Score] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/703652 (owner: 10Legoktm) [19:05:58] legoktm: does that mean Score will soon work again? 🙂 [19:06:25] my goal is to enable it on testwiki next week [19:06:34] sounds wonderful! [19:06:37] thanks :) [19:07:17] :D [19:24:51] (03Merged) 10jenkins-bot: Allow setting a different path for `convert` just for Score [extensions/Score] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/703652 (owner: 10Legoktm) [19:27:34] !log legoktm@deploy1002 Synchronized php-1.37.0-wmf.12/extensions/Score/extension.json: Allow setting a different path for `convert` just for Score (1/2) (duration: 00m 58s) [19:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:12] !log legoktm@deploy1002 Synchronized php-1.37.0-wmf.12/extensions/Score/includes/Score.php: Allow setting a different path for `convert` just for Score (2/2) (duration: 00m 57s) [19:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:22] PROBLEM - SSH on logstash2021.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:31:57] legoktm: will advice for 3rd party installs follow? [19:32:12] what advice are you looking for? [19:36:44] (03PS7) 10Legoktm: Add configuration to use Score with Shellbox (still disabled) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703489 (https://phabricator.wikimedia.org/T281423) [19:36:46] (03PS1) 10Legoktm: Enable Score using Shellbox on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703790 (https://phabricator.wikimedia.org/T257066) [19:36:59] (03PS8) 10Legoktm: Add configuration to use Score with Shellbox (still disabled) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703489 (https://phabricator.wikimedia.org/T281423) [19:37:01] (03PS2) 10Legoktm: Enable Score using Shellbox on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703790 (https://phabricator.wikimedia.org/T257066) [19:37:44] legoktm: how we do it safely? Setting up shellbox etc [19:37:56] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [19:38:57] have you read https://gerrit.wikimedia.org/g/mediawiki/extensions/Score ? [19:39:27] but basically yes, set up Shellbox to contain lilypond, etc. [19:39:43] Shellbox does need much better docs though [19:41:07] Ty [19:41:45] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [19:49:26] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [19:51:25] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [20:23:20] (03CR) 10RLazarus: [C: 03+2] tcpircbot: Reduce max length of an IRC message. [puppet] - 10https://gerrit.wikimedia.org/r/703787 (https://phabricator.wikimedia.org/T285709) (owner: 10RLazarus) [20:31:10] RECOVERY - SSH on logstash2021.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:08:05] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Proton [21:11:55] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [22:03:20] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 400 (expecting: 404): /api (Zotero and citoid alive) is WARNING: Test Zotero and citoid alive responds with unexpected value at path [0]/itemType = webpage https://wikitech.wikimedia.org/wiki/Citoid [22:05:18] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [22:23:45] ^ I'm not sure what to do with those intermittent citoid issues, digging a bit [22:24:05] from the citoid dashboard it looks like a zotero issue, and of course the zotero dashboard famously doesn't say much of anything [22:24:47] not sure if it's a change in the backend or the traffic though [22:26:46] the test request at https://wikitech.wikimedia.org/wiki/Zotero#Verify fails a decent percentage of the time, so I'm guessing it's the backend [22:42:19] (03CR) 10Legoktm: [C: 03+2] Add configuration to use Score with Shellbox (still disabled) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703489 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [22:43:00] (03Merged) 10jenkins-bot: Add configuration to use Score with Shellbox (still disabled) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703489 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [22:44:02] rzl: I peeked at zotero late last week when it paged and didn't come away with anything useful [22:44:18] the logs were full with just CSS files [22:46:38] !log legoktm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Add configuration to use Score with Shellbox (still disabled) (1/2) - T281423 (duration: 00m 58s) [22:46:44] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 400 (expecting: 404): /api (Zotero and citoid alive) is WARNING: Test Zotero and citoid alive responds with unexpected value at path [0]/itemType = webpage https://wikitech.wikimedia.org/wiki/Citoid [22:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:46] T281423: New Service Request Shellbox - https://phabricator.wikimedia.org/T281423 [22:48:30] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [22:48:38] !log legoktm@deploy1002 Synchronized wmf-config/CommonSettings.php: Add configuration to use Score with Shellbox (still disabled) (2/2) - T281423 (duration: 00m 57s) [22:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:06] hm [22:51:08] (node:1) UnhandledPromiseRejectionWarning: Error: getaddrinfo ENOTFOUND en.wikipedia.org en.wikipedia.org:443 [22:51:43] (node:1) UnhandledPromiseRejectionWarning: Error: tunneling socket could not be established, cause=getaddrinfo ENOTFOUND url-downloader.codfw.wikimedia.org url-downloader.codfw.wikimedia.org:8080 [22:52:04] legoktm: huh okay, re zotero [22:53:36] it's probably too much to ask for timestamps in the zotero logs [22:53:52] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 400 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [22:57:28] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 400 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [22:58:21] the DNS failures only started in the past day [22:58:34] https://logstash.wikimedia.org/goto/8506229d577ef9184182d7b1b6a4b5fa [22:58:42] of course, no idea if that's the actual issue or a red herring [23:01:04] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [23:01:38] legoktm: I'll buy it, the timing lines up with the 3xx/5xxs on https://grafana-rw.wikimedia.org/d/NJkCVermz/citoid?orgId=1&from=now-2d&to=now&refresh=5m [23:01:56] sorry I mean, in the "citation request to zotero" graph [23:03:53] I don't have any suggestions on how to fix besides just restarting zotero [23:04:58] the errors are mostly evenly split across hosts too [23:52:40] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Proton [23:56:28] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton