[00:56:52] (03PS2) 10AntiCompositeNumber: VCL: Maps Referer block: allow wikimedia.it [puppet] - 10https://gerrit.wikimedia.org/r/703929 (https://phabricator.wikimedia.org/T261694) [02:20:11] 10Puppet, 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q1), 10User-fgiunchedi: Puppet failing on the alert hosts should alert - https://phabricator.wikimedia.org/T283151 (10lmata) [02:20:18] 10SRE, 10Graphite, 10SRE Observability (FY2021/2022-Q1): Enforce a minimum refresh period for grafana dashboards hitting graphite - https://phabricator.wikimedia.org/T119719 (10lmata) [02:21:27] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10SRE Observability (FY2021/2022-Q1), 10User-jbond: thanos u/i gives errors if left idle for a few hours - https://phabricator.wikimedia.org/T268233 (10lmata) [02:21:35] 10SRE, 10SRE Observability (FY2021/2022-Q1): Two close pages for idle workers api + appserver didn't auto-resolve on recovery - https://phabricator.wikimedia.org/T266570 (10lmata) [02:21:39] 10SRE, 10Wikimedia-Logstash, 10SRE Observability (FY2021/2022-Q1), 10Software-Licensing: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence - https://phabricator.wikimedia.org/T272238 (10lmata) [02:21:47] 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q1): Ingest Cron and Root Alerts Into Logstash - https://phabricator.wikimedia.org/T274377 (10lmata) [02:21:59] 10SRE, 10Analytics-Clusters, 10Analytics-Radar, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Move kafkamon hosts to Debian Buster - https://phabricator.wikimedia.org/T252773 (10lmata) [02:22:03] 10SRE, 10SRE Observability (FY2021/2022-Q1), 10Security, 10User-jbond: ulog: filter out diffscan from ulog - https://phabricator.wikimedia.org/T265590 (10lmata) [02:22:11] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10SRE Observability, 10netops: SCS CPU monitoring issue - https://phabricator.wikimedia.org/T285229 (10lmata) [02:22:27] 10SRE, 10Citoid, 10SRE Observability, 10Wikimedia-Logstash, and 2 others: Citoid is logging all request / response headers as separate fields - https://phabricator.wikimedia.org/T239713 (10lmata) [02:22:31] 10SRE, 10Privacy Engineering, 10SRE Observability, 10Wikimedia-Logstash, 10Privacy: Production logstash should be protected by two-factor auth, at the least - https://phabricator.wikimedia.org/T237630 (10lmata) [02:22:49] 10SRE, 10SRE Observability, 10Wikimedia-Logstash: Ingest production logs with ELK7 - https://phabricator.wikimedia.org/T235891 (10lmata) [02:22:57] 10SRE, 10Icinga, 10SRE Observability, 10Scap: expose hosts in maintenance state so we can prevent scap from running on them - https://phabricator.wikimedia.org/T100777 (10lmata) [02:23:05] 10SRE, 10SRE Observability, 10Graphite: PHP statsd client doesn't support tagging metrics - https://phabricator.wikimedia.org/T225721 (10lmata) [02:23:09] 10SRE, 10SRE Observability, 10Graphite: Include ADD operation in memcached stats and grafana dashboard - https://phabricator.wikimedia.org/T201016 (10lmata) [02:23:13] 10SRE, 10SRE Observability, 10Graphite: unused grafana-dashboard indices on elasticsearch / logstash - https://phabricator.wikimedia.org/T174172 (10lmata) [02:23:17] 10SRE, 10SRE Observability, 10Graphite: extend existing graphite whisper files retention to five years - https://phabricator.wikimedia.org/T138821 (10lmata) [02:23:35] 10SRE, 10Infrastructure-Foundations, 10Mail, 10SRE Observability, 10Sustainability (Incident Followup): Add exim queue size to grafana graph - https://phabricator.wikimedia.org/T275867 (10lmata) [02:23:54] 10SRE, 10Icinga, 10Infrastructure-Foundations, 10Mail, 10SRE Observability: fix/streamline mail routing off of neon - https://phabricator.wikimedia.org/T80890 (10lmata) [02:24:06] 10SRE, 10SRE Observability, 10Wikimedia-Logstash: Ingestion errors for production logs on ELK7 - https://phabricator.wikimedia.org/T240667 (10lmata) [02:24:16] 10SRE, 10SRE Observability, 10Documentation, 10Service-Architecture, 10Services (later): Create a doc explaining the SLA between services and the monitoring tool - https://phabricator.wikimedia.org/T105780 (10lmata) [02:24:26] 10SRE, 10SRE Observability, 10Wikimedia-Logstash, 10Patch-For-Review: elk7: fields indexed without position data; cannot run PhraseQuery - https://phabricator.wikimedia.org/T248400 (10lmata) [02:24:30] 10SRE, 10SRE Observability, 10Graphite, 10Performance-Team (Radar): Improve graphite failover - https://phabricator.wikimedia.org/T88997 (10lmata) [02:24:36] 10SRE, 10SRE Observability, 10Documentation, 10Graphite: document graphite failover/backfill procedures - https://phabricator.wikimedia.org/T102575 (10lmata) [02:24:44] 10SRE, 10SRE Observability, 10Graphite: grafana access control - https://phabricator.wikimedia.org/T108546 (10lmata) [02:24:54] 10SRE, 10Icinga, 10SRE Observability, 10serviceops: incident 20170323-wikibase did not trigger Icinga paging - https://phabricator.wikimedia.org/T161528 (10lmata) [02:25:00] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops: Keep calculating latencies for MediaWiki requests that happen k8s - https://phabricator.wikimedia.org/T276095 (10lmata) [02:25:24] 10SRE, 10SRE Observability, 10Traffic, 10Patch-For-Review: Implement SLI measurement for Varnish Frontend - https://phabricator.wikimedia.org/T284576 (10lmata) [02:25:48] 10SRE, 10SRE Observability, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Implement central logging for mailman3 - https://phabricator.wikimedia.org/T276697 (10lmata) [02:25:52] 10SRE, 10Icinga, 10SRE Observability: icinga login case mismatch - https://phabricator.wikimedia.org/T275920 (10lmata) [02:26:04] 10SRE, 10SRE Observability: Icinga meta monitoring pages during icinga host reboots - https://phabricator.wikimedia.org/T274662 (10lmata) [02:26:22] 10SRE, 10SRE Observability: thanos: 404 error trying to fetch js library - https://phabricator.wikimedia.org/T269000 (10lmata) [02:26:36] 10SRE, 10Infrastructure-Foundations, 10SRE Observability, 10CAS-SSO, and 3 others: Sign-in links from Grafana dashboards don't work when not signed into SSO - https://phabricator.wikimedia.org/T269272 (10lmata) [02:26:48] 10SRE, 10SRE Observability: smart-data-dump should fail loudly when it can't gather metrics - https://phabricator.wikimedia.org/T267135 (10lmata) [02:26:52] 10SRE, 10SRE Observability: librenms page didn't auto-resolve in VO - https://phabricator.wikimedia.org/T263423 (10lmata) [02:27:00] 10SRE, 10SRE Observability: Missing 'notify' for some Icinga configuration files - https://phabricator.wikimedia.org/T263027 (10lmata) [02:27:10] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10SRE Observability: Notification spam from "last puppet run" upon re-enabling puppet - https://phabricator.wikimedia.org/T263720 (10lmata) [02:27:14] 10SRE, 10Discovery-Search, 10SRE Observability, 10Patch-For-Review: Change logstash plugin deployment to use deb packaging and deployment - https://phabricator.wikimedia.org/T217340 (10lmata) [02:27:34] 10SRE, 10SRE Observability, 10Upstream: Grafana error: "parse error at char 1: unexpected character: '\\ufeff'" when copy-pasting metric names - https://phabricator.wikimedia.org/T263624 (10lmata) [02:27:40] 10SRE, 10SRE Observability, 10serviceops, 10Patch-For-Review: Strongswan Icinga check: do not report issues about depooled hosts - https://phabricator.wikimedia.org/T148976 (10lmata) [02:27:44] 10SRE, 10SRE Observability, 10Wikimedia-Logstash: Buster elasticsearch-curator version not compatible with ELK7 - https://phabricator.wikimedia.org/T257024 (10lmata) [02:27:57] 10SRE, 10SRE Observability: Icinga refresh hardware selection (2020) - https://phabricator.wikimedia.org/T251644 (10lmata) [02:28:00] 10SRE, 10SRE Observability, 10User-fgiunchedi: Port Prometheus dashboards to Thanos - https://phabricator.wikimedia.org/T256954 (10lmata) [02:28:11] 10SRE, 10Elasticsearch, 10SRE Observability, 10Wikimedia-Logstash, 10Services (watching): logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable - https://phabricator.wikimedia.org/T176335 (10lmata) [02:28:15] 10SRE, 10SRE Observability: Puppet fail to properly refresh Icinga - https://phabricator.wikimedia.org/T184714 (10lmata) [02:28:19] 10SRE, 10SRE Observability: Aggregate prometheus functions yielding different results in grafana vs. prometheus console - https://phabricator.wikimedia.org/T168403 (10lmata) [02:28:23] 10SRE, 10SRE Observability, 10User-fgiunchedi: Review prometheus_nodes params - https://phabricator.wikimedia.org/T207292 (10lmata) [02:28:29] 10SRE, 10SRE Observability, 10Wikimedia-Logstash, 10Patch-For-Review, 10Services (watching): Reduce the number of fields declared in elasticsearch by logstash - https://phabricator.wikimedia.org/T180051 (10lmata) [02:28:47] 10SRE, 10Cloud-VPS, 10SRE Observability: UNIX group 'bird' missing on bird package installation - https://phabricator.wikimedia.org/T260240 (10lmata) [02:28:53] 10SRE, 10SRE Observability, 10serviceops, 10Datacenter-Switchover: Figure out switchover steps for mwlog hosts - https://phabricator.wikimedia.org/T261274 (10lmata) [02:28:59] 10SRE, 10SRE Observability, 10Traffic, 10Patch-For-Review: varnishmtail silently stops working if varnishncsa crashes - https://phabricator.wikimedia.org/T259020 (10lmata) [02:29:03] 10SRE, 10SRE Observability: Monitoring: add link to graph for Icinga timeseries alarms - https://phabricator.wikimedia.org/T167422 (10lmata) [02:29:09] 10SRE, 10SRE Observability, 10Traffic, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Document and/or improve navigation of the various HTTP frontend Grafana dashboards - https://phabricator.wikimedia.org/T253655 (10lmata) [02:29:19] 10SRE, 10SRE Observability, 10Wikimedia-Logstash: Log lines on flourine overflow at 8092 bytes. - https://phabricator.wikimedia.org/T114849 (10lmata) [02:29:31] 10SRE, 10SRE Observability, 10Wikimedia-Logstash: Rationalize default logrotate "rotated" file extensions - https://phabricator.wikimedia.org/T207296 (10lmata) [02:29:35] 10SRE, 10SRE Observability: automation: issue reminders for about-to-expire downtimes - https://phabricator.wikimedia.org/T230633 (10lmata) [02:29:39] 10SRE, 10Icinga, 10SRE Observability: icinga-wm bot truncating long messages - https://phabricator.wikimedia.org/T230799 (10lmata) [02:29:59] 10SRE, 10SRE Observability: WMF's Grafana installation does not follow Wikimedia's visual identity guidelines - https://phabricator.wikimedia.org/T214762 (10lmata) [02:30:03] 10SRE, 10SRE Observability, 10Graphite: uwsgi-graphite-web.service not functional after reboots of Graphite hosts - https://phabricator.wikimedia.org/T226694 (10lmata) [02:30:09] 10SRE, 10Icinga, 10Infrastructure-Foundations, 10SRE Observability, 10SRE-tools: ops-monitoring-bot creating dupes - https://phabricator.wikimedia.org/T226908 (10lmata) [02:30:17] 10SRE, 10Icinga, 10SRE Observability: Aggregate Proton, Restbase and mobileapps icinga alerts - https://phabricator.wikimedia.org/T250017 (10lmata) [02:30:21] 10SRE, 10Icinga, 10SRE Observability: Icinga notifications didn't get applied after a puppet run - https://phabricator.wikimedia.org/T251407 (10lmata) [02:30:33] 10SRE, 10SRE Observability, 10User-fgiunchedi: Include apache_exporter in puppet module httpd (was: apache) - https://phabricator.wikimedia.org/T187434 (10lmata) [02:30:37] 10SRE, 10SRE Observability, 10User-CDanis: Find links to grafana.wikimedia.org and change them to use the new URL format - https://phabricator.wikimedia.org/T211982 (10lmata) [02:30:43] 10Puppet, 10SRE, 10Icinga, 10Infrastructure-Foundations, 10SRE Observability: Puppet failing without Icinga alert in case of dependency cycle - https://phabricator.wikimedia.org/T221784 (10lmata) [02:30:47] 10SRE, 10Icinga, 10SRE Observability: icinga really needs to check puppet run success of passive icinga hosts - https://phabricator.wikimedia.org/T215848 (10lmata) [02:30:51] 10SRE, 10SRE Observability: Stop using public (cached) endpoints for checks on graphite - https://phabricator.wikimedia.org/T219902 (10lmata) [02:30:55] 10SRE, 10SRE Observability, 10Sustainability (Incident Followup): prometheus: usable dashboard for meta-metrics about Prometheus itself (query durations etc) - https://phabricator.wikimedia.org/T222102 (10lmata) [02:30:59] 10SRE, 10SRE Observability, 10Sustainability (Incident Followup): prometheus: upgrade to >= 2.12 - https://phabricator.wikimedia.org/T222113 (10lmata) [02:31:05] 10SRE, 10SRE Observability, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Ship host syslogs to ELK - https://phabricator.wikimedia.org/T193766 (10lmata) [02:31:09] 10SRE, 10SRE Observability, 10Wikimedia-Logstash, 10User-herron: Ship MX logs to ELK - https://phabricator.wikimedia.org/T197173 (10lmata) [02:31:13] 10SRE, 10SRE Observability, 10Wikimedia-Logstash: Move wikimania-scholarships from udp2log to syslog - https://phabricator.wikimedia.org/T215499 (10lmata) [02:31:17] 10SRE, 10SRE Observability, 10Wikimedia-Logstash, 10Patch-For-Review: Move iegreview from udp2log to syslog - https://phabricator.wikimedia.org/T215497 (10lmata) [02:31:21] 10SRE, 10SRE Observability: check_mw_versions alerts for each individual app server - https://phabricator.wikimedia.org/T251942 (10lmata) [02:31:25] 10SRE, 10SRE Observability, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Implement sensitive logstash access control - https://phabricator.wikimedia.org/T213902 (10lmata) [02:31:41] 10SRE, 10SRE Observability: Add slabinfo prometheus exporter - https://phabricator.wikimedia.org/T160071 (10lmata) [02:31:45] 10SRE, 10SRE Observability: Better abstractions for puppet & icinga/nagios/shinken - https://phabricator.wikimedia.org/T85624 (10lmata) [02:31:49] 10SRE, 10SRE Observability: Monitor the BMC's event log for hardware errors - https://phabricator.wikimedia.org/T136311 (10lmata) [02:31:58] 10SRE, 10Discovery-Search, 10Elasticsearch, 10SRE Observability, 10Wikimedia-Logstash: Migrate Elasticsearch from deprecated Gelf logstash input to rsyslog Kafka logging pipeline - https://phabricator.wikimedia.org/T225125 (10lmata) [02:32:04] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10SRE Observability: Icinga alert for hosts with no Puppet roles - https://phabricator.wikimedia.org/T238006 (10lmata) [02:32:08] 10SRE, 10SRE Observability, 10User-CDanis, 10User-fgiunchedi: Better organization for SRE grafana dashboards - https://phabricator.wikimedia.org/T178690 (10lmata) [02:32:24] 10SRE, 10SRE Observability: dropped packets to kafkamon 9000/tcp - https://phabricator.wikimedia.org/T238794 (10lmata) [02:32:34] 10SRE, 10SRE Observability, 10serviceops: rsyslogd: omkafka: action will suspended due to kafka error -187: Local: All broker connections are down - https://phabricator.wikimedia.org/T240560 (10lmata) [02:32:38] 10SRE, 10SRE Observability: librenms: consider using Distributed Poller with multiple netmon servers - https://phabricator.wikimedia.org/T171122 (10lmata) [02:32:44] 10SRE, 10MediaWiki-Debug-Logger, 10SRE Observability, 10Wikimedia-Logstash, 10Patch-For-Review: MediaWiki logging & encryption - https://phabricator.wikimedia.org/T126989 (10lmata) [02:32:48] 10SRE, 10SRE Observability: improve cron spam visibility - https://phabricator.wikimedia.org/T84845 (10lmata) [02:32:52] 10SRE, 10Icinga, 10SRE Observability, 10Patch-For-Review, 10User-herron: Icinga check for sysctl settings - https://phabricator.wikimedia.org/T160060 (10lmata) [02:32:56] 10SRE, 10Icinga, 10SRE Observability, 10User-CDanis: CLI script for manual paging - https://phabricator.wikimedia.org/T82937 (10lmata) [02:33:00] 10SRE, 10SRE Observability, 10Patch-For-Review: Cluster puppet variable and ganglia decommission - https://phabricator.wikimedia.org/T179395 (10lmata) [02:33:08] 10SRE, 10SRE Observability: Icinga check for ipv6 host reachability - https://phabricator.wikimedia.org/T163996 (10lmata) [02:33:12] 10SRE, 10SRE Observability, 10Performance-Team (Radar): "Workers" data from prometheus for mw app servers alternates strangely - https://phabricator.wikimedia.org/T206939 (10lmata) [02:33:22] 10SRE, 10SRE Observability, 10Patch-For-Review, 10Performance-Team (Radar): Revisit Grafana/Icinga notification strategy - https://phabricator.wikimedia.org/T203485 (10lmata) [02:33:28] 10SRE, 10SRE Observability: Update prometheus-node-exporter NTP metrics - https://phabricator.wikimedia.org/T208875 (10lmata) [02:33:40] 10SRE, 10Infrastructure-Foundations, 10SRE Observability, 10netops: replace check_ripe_atlas Python script with a check_prometheus backed by atlasexporter data - https://phabricator.wikimedia.org/T251155 (10lmata) [02:33:48] 10SRE, 10Infrastructure-Foundations, 10SRE Observability, 10netops: add traceroute measurements to RIPE Atlas prometheus data - https://phabricator.wikimedia.org/T251156 (10lmata) [02:33:52] 10SRE, 10SRE Observability: Review Icinga alarms with disabled notifications - https://phabricator.wikimedia.org/T149643 (10lmata) [02:39:18] 10SRE, 10SRE Observability, 10WMF-Legal, 10Graphite, 10Software-Licensing: Add license statement to Grafana dashboards - https://phabricator.wikimedia.org/T214819 (10lmata) [02:39:20] 10SRE, 10SRE Observability, 10Goal: Export Prometheus-compatible JVM metrics from JVMs in production - https://phabricator.wikimedia.org/T177197 (10lmata) [02:39:24] 10SRE, 10Infrastructure-Foundations, 10Mail, 10SRE Observability, 10User-MoritzMuehlenhoff: Fix paniclog alert to only sent mails once - https://phabricator.wikimedia.org/T257016 (10lmata) [02:39:32] 10SRE, 10SRE Observability, 10Wikimedia-Logstash, 10Patch-For-Review: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 (10lmata) [02:39:38] 10SRE, 10Analytics-Radar, 10SRE Observability, 10Wikimedia-Logstash, 10Performance-Team (Radar): Retire udp2log: onboard its producers and consumers to the logging pipeline - https://phabricator.wikimedia.org/T205856 (10lmata) [02:39:48] 10SRE, 10Infrastructure-Foundations, 10SRE Observability, 10netops: Provision plaintext syslog collectors in esams/ulsfo/eqsin - https://phabricator.wikimedia.org/T243065 (10lmata) [02:39:52] 10SRE, 10SRE Observability, 10Epic: Monitor and alarm on SMART attributes [tracking] - https://phabricator.wikimedia.org/T86552 (10lmata) [02:39:56] 10SRE, 10SRE Observability, 10Graphite, 10Patch-For-Review, 10User-fgiunchedi: Programmatic generation of grafana dashboards - https://phabricator.wikimedia.org/T171482 (10lmata) [02:40:37] 10SRE, 10SRE Observability, 10User-fgiunchedi: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835 (10lmata) [02:40:46] 10SRE, 10SRE Observability, 10SRE-OnFire: Ensure SRE team has a good understanding of how & when to declare an outage on the status page; & it is easy to do so - https://phabricator.wikimedia.org/T285769 (10lmata) [02:40:53] 10SRE, 10SRE Observability, 10Datacenter-Switchover: Switchover thanos-query and thanos-swift services as part of DC switchover - https://phabricator.wikimedia.org/T285273 (10lmata) [02:41:14] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10lmata) [02:41:44] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Leverage Grafana annotations to show events in graphs - https://phabricator.wikimedia.org/T222826 (10lmata) [02:41:51] 10SRE, 10SRE-OnFire, 10SRE Observability (FY2021/2022-Q1): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10lmata) [02:41:53] 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10SRE Observability (FY2021/2022-Q1): Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10lmata) [02:41:57] 10SRE, 10Wikimedia-Logstash, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Standardize the logging format - https://phabricator.wikimedia.org/T234565 (10lmata) [03:30:52] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:54:29] (03PS1) 10Ladsgroup: Set testcommonswiki to use json image metadata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703950 (https://phabricator.wikimedia.org/T275268) [04:00:03] (03CR) 10Ladsgroup: [C: 03+2] "this is not very performant but it's temporary" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703950 (https://phabricator.wikimedia.org/T275268) (owner: 10Ladsgroup) [04:00:45] (03Merged) 10jenkins-bot: Set testcommonswiki to use json image metadata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703950 (https://phabricator.wikimedia.org/T275268) (owner: 10Ladsgroup) [04:04:58] (03PS1) 10Ladsgroup: Remove subscribing to other aspect for entity usage [extensions/Wikibase] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/703890 (https://phabricator.wikimedia.org/T286193) [04:08:38] !log ladsgroup@deploy1002 Synchronized wmf-config/filebackend.php: Config: [[gerrit:703950|Set testcommonswiki to use json image metadata (T275268)]] (duration: 01m 10s) [04:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:08:46] T275268: Address "image" table capacity problems by storing pdf/djvu text outside file metadata - https://phabricator.wikimedia.org/T275268 [04:10:38] !log mwscript refreshImageMetadata.php --wiki=testcommonswiki --mediatype=OFFICE --batch-size=20 --verbose --mime="application/pdf" --force (T275268) [04:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:14:01] (03CR) 10Ladsgroup: "This change is ready for review." [core] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/703891 (owner: 10Ladsgroup) [04:22:23] (03PS1) 10Ladsgroup: Enable json image metadata everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703951 (https://phabricator.wikimedia.org/T275268) [04:24:24] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [04:31:19] (03PS1) 10Ladsgroup: objectcache: Normalize exptime to ttl in APCu and WinCache [core] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/703892 (https://phabricator.wikimedia.org/T286260) [04:34:23] (03CR) 10Ladsgroup: [C: 03+2] Add --sleep option to refreshImageMetadata.php [core] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/703891 (owner: 10Ladsgroup) [04:35:38] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [04:36:31] This doesn't seem to be related to my patch ^ [04:52:41] (03Merged) 10jenkins-bot: Add --sleep option to refreshImageMetadata.php [core] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/703891 (owner: 10Ladsgroup) [04:56:56] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.12/maintenance/refreshImageMetadata.php: Backport: [[gerrit:703891|Add --sleep option to refreshImageMetadata.php]] (duration: 01m 04s) [04:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:58:38] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [05:00:20] (03CR) 10Ladsgroup: [C: 03+2] "let the party begin." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703951 (https://phabricator.wikimedia.org/T275268) (owner: 10Ladsgroup) [05:01:03] (03Merged) 10jenkins-bot: Enable json image metadata everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703951 (https://phabricator.wikimedia.org/T275268) (owner: 10Ladsgroup) [05:05:31] everything looks fine on mwdebug2002, moving forward [05:06:11] !log ladsgroup@deploy1002 Synchronized wmf-config/filebackend.php: Config: [[gerrit:703951|Enable json image metadata everywhere (T275268)]] (duration: 01m 05s) [05:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:17] T275268: Address "image" table capacity problems by storing pdf/djvu text outside file metadata - https://phabricator.wikimedia.org/T275268 [05:14:39] !log start of mwscript refreshImageMetadata.php --wiki=commonswiki --mediatype=OFFICE --batch-size=10 --verbose --mime="application/pdf" --force --sleep 5 on screen - It will take days / week to finish (T275268) [05:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:46] T275268: Address "image" table capacity problems by storing pdf/djvu text outside file metadata - https://phabricator.wikimedia.org/T275268 [06:13:24] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:17:16] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:37:16] (03CR) 10JMeybohm: [C: 03+1] "I was under the impression that recreatePods does not work at all anymore. Glad that's not the case!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/703848 (owner: 10Giuseppe Lavagetto) [07:01:07] !log installing apache2 security updates [07:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:28] (03PS10) 10Jcrespo: mediabackup: Install minio on the storage hosts and open port 9000 [puppet] - 10https://gerrit.wikimedia.org/r/694332 (https://phabricator.wikimedia.org/T276442) [07:30:30] (03PS1) 10Jcrespo: dbbackups: Reduce db1102 x1 memory usage [puppet] - 10https://gerrit.wikimedia.org/r/704064 [07:30:56] (03PS2) 10Jcrespo: dbbackups: Reduce db1102 x1 memory usage [puppet] - 10https://gerrit.wikimedia.org/r/704064 [07:32:16] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10elukey) >>! In T286032#7197078, @MoritzMuehlenhoff wrote: > Looking at Ganeti VMs, they broadly fall under three/four categories: >... [07:34:12] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:39:55] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10elukey) [07:40:46] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Reduce db1102 x1 memory usage [puppet] - 10https://gerrit.wikimedia.org/r/704064 (owner: 10Jcrespo) [07:41:56] (03PS2) 10ArielGlenn: dumps: Migrate kiwix update cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/703470 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [07:43:07] (03CR) 10ArielGlenn: [C: 03+2] dumps: Migrate kiwix update cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/703470 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [07:44:51] !log restart db1102:x1 mariadb instance [07:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:16] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [07:56:32] (03PS1) 10ArielGlenn: Add new XML/SQL dumps mirror (California, USA) [puppet] - 10https://gerrit.wikimedia.org/r/704065 [07:57:48] (03CR) 10Ladsgroup: [C: 03+2] "deploying" [extensions/Wikibase] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/703890 (https://phabricator.wikimedia.org/T286193) (owner: 10Ladsgroup) [07:59:37] (03CR) 10ArielGlenn: [C: 03+2] Add new XML/SQL dumps mirror (California, USA) [puppet] - 10https://gerrit.wikimedia.org/r/704065 (owner: 10ArielGlenn) [08:02:10] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [08:04:03] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [08:04:15] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [08:04:25] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [08:04:33] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [08:04:45] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [08:04:51] 10SRE, 10Analytics, 10Tracking-Neverending: Trash cleanup cron spams on an-test hosts - https://phabricator.wikimedia.org/T286442 (10fgiunchedi) [08:05:15] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [08:05:48] godog: is when there's no longer cron spam not the end? Hence not tracking-neverending [08:07:50] RhinosF1: agreed, I created a subtask directly and thus the tag stuck, feel free to remove it though! [08:08:30] godog: ty [08:08:51] 10SRE, 10Analytics: Trash cleanup cron spams on an-test hosts - https://phabricator.wikimedia.org/T286442 (10RhinosF1) [08:12:05] 10SRE, 10Machine-Learning-Team, 10serviceops, 10Kubernetes, 10Patch-For-Review: Add the possibility to deploy calico on kubernetes master nodes - https://phabricator.wikimedia.org/T285927 (10elukey) `base::expose_puppet_certs` is used in both master and node profiles, with different settings: * on master... [08:12:58] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:13:15] 10SRE, 10Analytics: Trash cleanup cron spams on an-test hosts - https://phabricator.wikimedia.org/T286442 (10Ladsgroup) Is it coming from puppet? It should be migrated to systemd timer if that's the case: {T273673} [08:14:01] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10ArielGlenn) With the new schedule I think I can swap one dumpsdata host and one snapshot host and avoid any impact whatsoever on XMl/SQL dumps.... [08:15:33] (03CR) 10Ema: [C: 03+1] nc_redirects: Remove wikimedia.com rule [puppet] - 10https://gerrit.wikimedia.org/r/703910 (https://phabricator.wikimedia.org/T286377) (owner: 10Vgutierrez) [08:15:55] (03CR) 10Ema: [C: 03+1] acme-chief: Drop wikimedia.com related SNIs [puppet] - 10https://gerrit.wikimedia.org/r/703911 (https://phabricator.wikimedia.org/T286377) (owner: 10Vgutierrez) [08:16:30] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:18:58] (03PS10) 10Vgutierrez: varnish: Add listen on UDS support [puppet] - 10https://gerrit.wikimedia.org/r/701056 (https://phabricator.wikimedia.org/T285374) [08:19:16] (03Merged) 10jenkins-bot: Remove subscribing to other aspect for entity usage [extensions/Wikibase] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/703890 (https://phabricator.wikimedia.org/T286193) (owner: 10Ladsgroup) [08:19:35] (03CR) 10Vgutierrez: [C: 03+2] nc_redirects: Remove wikimedia.com rule [puppet] - 10https://gerrit.wikimedia.org/r/703910 (https://phabricator.wikimedia.org/T286377) (owner: 10Vgutierrez) [08:20:33] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30161/console" [puppet] - 10https://gerrit.wikimedia.org/r/701056 (https://phabricator.wikimedia.org/T285374) (owner: 10Vgutierrez) [08:21:01] 10SRE, 10SRE Observability, 10User-fgiunchedi: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835 (10fgiunchedi) There are still upload errors from the compactor as noted above, especially when uploading large blocks. Not sure if related yet but I noticed that `swift-recon... [08:23:49] (03PS11) 10Vgutierrez: varnish: Add listen on UDS support [puppet] - 10https://gerrit.wikimedia.org/r/701056 (https://phabricator.wikimedia.org/T285374) [08:25:07] (03PS9) 10Elukey: ml_k8s::master: add profile::kubernetes::node [puppet] - 10https://gerrit.wikimedia.org/r/702645 (https://phabricator.wikimedia.org/T285927) [08:26:01] (03CR) 10Ema: [C: 03+1] varnish: Add listen on UDS support [puppet] - 10https://gerrit.wikimedia.org/r/701056 (https://phabricator.wikimedia.org/T285374) (owner: 10Vgutierrez) [08:26:08] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.12/extensions/Wikibase/client: Backport: [[gerrit:703890|Remove subscribing to other aspect for entity usage (T286193)]] (duration: 00m 59s) [08:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:15] T286193: Drop subscribing to "other" aspect when using statements - https://phabricator.wikimedia.org/T286193 [08:30:50] (03CR) 10Vgutierrez: [C: 03+1] varnish: use 403 instead of 429 where appropriate [puppet] - 10https://gerrit.wikimedia.org/r/702896 (https://phabricator.wikimedia.org/T224891) (owner: 10Ema) [08:36:01] (03CR) 10Vgutierrez: [C: 03+2] acme-chief: Drop wikimedia.com related SNIs [puppet] - 10https://gerrit.wikimedia.org/r/703911 (https://phabricator.wikimedia.org/T286377) (owner: 10Vgutierrez) [08:37:10] (03CR) 10Ema: [C: 03+2] varnish: use 403 instead of 429 where appropriate [puppet] - 10https://gerrit.wikimedia.org/r/702896 (https://phabricator.wikimedia.org/T224891) (owner: 10Ema) [08:38:11] !log test a single frontend for thanos-swift / thanos-query to test "bad host" theory - T285835 [08:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:17] T285835: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835 [08:41:17] 10SRE, 10Traffic, 10Patch-For-Review: LetsEncrypt cert expiration warning for some ncredir names - https://phabricator.wikimedia.org/T286377 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [08:43:23] 10SRE, 10Machine-Learning-Team, 10serviceops, 10Kubernetes, 10Patch-For-Review: Add the possibility to deploy calico on kubernetes master nodes - https://phabricator.wikimedia.org/T285927 (10elukey) After a chat with Janis we reviewed the master's code and found https://gerrit.wikimedia.org/r/c/operation... [08:43:29] 10SRE, 10MW-on-K8s, 10serviceops: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10JMeybohm) We also talked about using Istio Ingress in the past (envoy-based) which could be a good fit as well and we could share technology and resources with ML... [08:44:38] PROBLEM - SSH on logstash2021.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:47:12] 10SRE, 10ops-codfw: mgmt on logstash2021 inaccessible - https://phabricator.wikimedia.org/T286274 (10MoritzMuehlenhoff) In fact this was flapping, could be related to T283582 [08:49:55] (03PS10) 10Elukey: ml_k8s::master: add profile::kubernetes::node [puppet] - 10https://gerrit.wikimedia.org/r/702645 (https://phabricator.wikimedia.org/T285927) [08:49:57] (03PS1) 10Elukey: role::ml_k8s::master: avoid exposing puppet certs [puppet] - 10https://gerrit.wikimedia.org/r/704071 (https://phabricator.wikimedia.org/T285927) [08:51:46] (03PS1) 10Kormat: install_server: Set db1183 to be partitioned on install. [puppet] - 10https://gerrit.wikimedia.org/r/704072 (https://phabricator.wikimedia.org/T284622) [08:55:31] (03CR) 10Kormat: [C: 03+2] install_server: Set db1183 to be partitioned on install. [puppet] - 10https://gerrit.wikimedia.org/r/704072 (https://phabricator.wikimedia.org/T284622) (owner: 10Kormat) [08:55:48] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30165/console" [puppet] - 10https://gerrit.wikimedia.org/r/702645 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [08:57:13] (03CR) 10Elukey: [C: 03+2] role::ml_k8s::master: avoid exposing puppet certs [puppet] - 10https://gerrit.wikimedia.org/r/704071 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [08:59:49] (03Abandoned) 10Elukey: profile::kubernetes::node: add hiera config to expose puppet certs [puppet] - 10https://gerrit.wikimedia.org/r/702983 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [09:00:05] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10aborrero) >>! In T286065#7194569, @Bstorm wrote: > @aborrero does cloudgw require manual failover? it doesn't require manual failover, but we could... [09:02:25] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Kormat) [09:07:15] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [09:07:17] !log repool thanos-fe2002 - T285835 [09:07:21] (03PS11) 10Elukey: ml_k8s::master: add profile::kubernetes::node [puppet] - 10https://gerrit.wikimedia.org/r/702645 (https://phabricator.wikimedia.org/T285927) [09:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:24] T285835: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835 [09:07:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rdb1006.eqiad.wmnet [09:07:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:01] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [09:10:16] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1183.eqiad.wmnet with reason: REIMAGE [09:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:02] 10SRE, 10DBA, 10Datacenter-Switchover: switchdc should automatically downtime "Read only" checks on DB masters being switched - https://phabricator.wikimedia.org/T285803 (10Kormat) >>! In T285803#7203653, @RLazarus wrote: > The downtime cookbook uses [[https://gerrit.wikimedia.org/r/plugins/gitiles/operation... [09:12:14] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [09:12:16] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 5 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30167/console" [puppet] - 10https://gerrit.wikimedia.org/r/702645 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [09:12:18] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1183.eqiad.wmnet with reason: REIMAGE [09:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:33] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [09:14:03] (03CR) 10JMeybohm: [C: 03+2] dragonfly: Remove fetching of $docker_registry_fqdn cert [puppet] - 10https://gerrit.wikimedia.org/r/702979 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [09:14:59] (03PS3) 10JMeybohm: site/install_server: Add dragonfly-supernode1001 to DHCP and site.pp [puppet] - 10https://gerrit.wikimedia.org/r/702982 (https://phabricator.wikimedia.org/T286057) [09:15:06] (03CR) 10Dzahn: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/703795 (owner: 10Legoktm) [09:15:27] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [09:15:36] yamlyaml [09:15:55] apperently its all about yaml today :P [09:16:04] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [09:16:37] (03CR) 10Vgutierrez: [C: 03+2] varnish: Add listen on UDS support [puppet] - 10https://gerrit.wikimedia.org/r/701056 (https://phabricator.wikimedia.org/T285374) (owner: 10Vgutierrez) [09:17:42] (03CR) 10JMeybohm: [C: 03+2] site/install_server: Add dragonfly-supernode1001 to DHCP and site.pp [puppet] - 10https://gerrit.wikimedia.org/r/702982 (https://phabricator.wikimedia.org/T286057) (owner: 10JMeybohm) [09:18:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1006.eqiad.wmnet [09:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:20] fancy, jbond [09:23:19] :) complete fluke mutante would need to google to find out what characxter combination i typed ;) [09:24:52] (03PS1) 10Jbond: hiera cloud pki: add cloud defaults [puppet] - 10https://gerrit.wikimedia.org/r/704077 [09:25:08] (03CR) 10Dzahn: [C: 03+2] "query tested" [puppet] - 10https://gerrit.wikimedia.org/r/703428 (https://phabricator.wikimedia.org/T286181) (owner: 10Aklapper) [09:25:38] jbond: lol, ok :) [09:29:38] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10MoritzMuehlenhoff) [09:30:00] PROBLEM - Thanos sidecar cannot connect to Prometheus on alert1001 is CRITICAL: cluster=prometheus instance=prometheus1003 job=thanos-sidecar prometheus=ops site=eqiad https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar [09:30:25] (03CR) 10Jbond: [C: 03+2] hiera cloud pki: add cloud defaults [puppet] - 10https://gerrit.wikimedia.org/r/704077 (owner: 10Jbond) [09:31:00] yeah well I can't tell what chars those are either, little boxes with hex is what I get (missing unicode chars in the font) [09:31:17] the thanos-sidecar alert is me [09:31:44] RECOVERY - Thanos sidecar cannot connect to Prometheus on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar [09:32:24] 10SRE, 10serviceops, 10Datacenter-Switchover: Document communication expectations around planning a DC switchover - https://phabricator.wikimedia.org/T285806 (10Kormat) >>! In T285806#7191055, @wkandek wrote: > Thanks everybody for the feedback on the communications for the DC switchover process. We will spe... [09:32:52] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 134 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:36:24] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:36:55] 10SRE, 10SRE Observability, 10User-fgiunchedi: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835 (10fgiunchedi) >>! In T285835#7193190, @fgiunchedi wrote: >>>! In T285835#7193102, @Stashbot wrote: >> {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https:... [09:37:39] (03PS1) 10Kormat: Revert "install_server: Set db1183 to be partitioned on install." [puppet] - 10https://gerrit.wikimedia.org/r/703893 [09:39:02] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [09:39:33] (03PS2) 10Kormat: Revert "install_server: Set db1183 to be partitioned on install." [puppet] - 10https://gerrit.wikimedia.org/r/703893 [09:40:17] (03CR) 10Kormat: [C: 03+2] Revert "install_server: Set db1183 to be partitioned on install." [puppet] - 10https://gerrit.wikimedia.org/r/703893 (owner: 10Kormat) [09:43:31] (03PS1) 10Kormat: db1183: Assign role, configure hiera values. [puppet] - 10https://gerrit.wikimedia.org/r/704079 (https://phabricator.wikimedia.org/T284622) [09:50:12] (03PS1) 10Jbond: control: drop dependency on ${shlibs:Depends} [debs/cfssl] - 10https://gerrit.wikimedia.org/r/704082 [09:50:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rdb1010.eqiad.wmnet [09:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:38] (03CR) 10Kormat: [C: 03+2] db1183: Assign role, configure hiera values. [puppet] - 10https://gerrit.wikimedia.org/r/704079 (https://phabricator.wikimedia.org/T284622) (owner: 10Kormat) [09:50:58] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [debs/cfssl] - 10https://gerrit.wikimedia.org/r/704082 (owner: 10Jbond) [09:51:20] (03CR) 10Jbond: [V: 03+2 C: 03+2] control: drop dependency on ${shlibs:Depends} [debs/cfssl] - 10https://gerrit.wikimedia.org/r/704082 (owner: 10Jbond) [09:52:29] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [09:53:23] (03PS1) 10Jbond: changlog: bump version [debs/cfssl] - 10https://gerrit.wikimedia.org/r/704083 [09:53:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1010.eqiad.wmnet [09:53:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:43] (03CR) 10Jbond: [V: 03+2 C: 03+2] changlog: bump version [debs/cfssl] - 10https://gerrit.wikimedia.org/r/704083 (owner: 10Jbond) [09:59:41] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [10:01:34] !log test thanos-compact upload with smaller part size - T285835 [10:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:41] T285835: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835 [10:02:12] 10Puppet, 10SRE, 10Icinga, 10Infrastructure-Foundations, and 2 others: Puppet failing without Icinga alert in case of dependency cycle - https://phabricator.wikimedia.org/T221784 (10jbond) 05Open→03Resolved a:03jbond being bold and closing this based on last comment [10:02:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rdb1009.eqiad.wmnet [10:02:52] PROBLEM - Host ml-serve-ctrl1001 is DOWN: PING CRITICAL - Packet loss = 100% [10:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:39] !log depool mw2383 [10:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:25] ml-serve-ctrl is me :) [10:05:18] !log planet - deleting state files, manually running update for all 161 en feeds - T285251 [10:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:24] T285251: techblog posts not appearing on Wikimedia Planet - https://phabricator.wikimedia.org/T285251 [10:05:26] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:05:28] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:05:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] puppet.refresh_certs: don't fail if resources changed [software/spicerack] - 10https://gerrit.wikimedia.org/r/701876 (owner: 10David Caro) [10:06:14] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.517e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [10:07:06] RECOVERY - Host ml-serve-ctrl1001 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [10:07:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "I lost track of how much time I spent on the puppet dancing this is automating. I even created a script myself to automate the same, but i" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702082 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [10:07:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1009.eqiad.wmnet [10:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs: ran black and isort [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702083 (owner: 10David Caro) [10:08:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs: add default control node to openstack api [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702084 (owner: 10David Caro) [10:08:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs: namespace exceptions [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702085 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [10:10:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs: quote some parameters to openstack [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702086 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [10:10:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.OpenstackApi: allow soft affinities to be specified [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702087 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [10:11:13] !log add 10g disk to ml-serve-ctrl[12]00[12] for T285927 [10:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:19] T285927: Add the possibility to deploy calico on kubernetes master nodes - https://phabricator.wikimedia.org/T285927 [10:12:02] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=CREATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:12:10] 10SRE, 10vm-requests: eqiad: 1 VM request for Dragonfly supernode - https://phabricator.wikimedia.org/T286057 (10JMeybohm) 05Open→03Resolved [10:12:13] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.start_instance_with_prefix: allow passing the affinity (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702088 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [10:14:16] (03CR) 10Volans: "reply inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/701876 (owner: 10David Caro) [10:16:37] (03CR) 10Jbond: [C: 03+1] puppet.refresh_certs: don't fail if resources changed (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/701876 (owner: 10David Caro) [10:18:11] (03CR) 10Volans: puppet.refresh_certs: don't fail if resources changed (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/701876 (owner: 10David Caro) [10:21:50] (03PS1) 10Jbond: control: drop dependency on ${shlibs:Depends} [debs/cfssl] - 10https://gerrit.wikimedia.org/r/704087 [10:22:18] (03CR) 10Jbond: [V: 03+2 C: 03+2] control: drop dependency on ${shlibs:Depends} [debs/cfssl] - 10https://gerrit.wikimedia.org/r/704087 (owner: 10Jbond) [10:23:28] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:24:47] (03PS3) 10David Caro: puppet.refresh_certs: don't rely on puppet return code [software/spicerack] - 10https://gerrit.wikimedia.org/r/701876 [10:24:49] (03CR) 10David Caro: puppet.refresh_certs: don't rely on puppet return code (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/701876 (owner: 10David Caro) [10:25:27] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/701876 (owner: 10David Caro) [10:25:40] (03CR) 10Volans: [C: 03+1] "LGTM, thanks!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/701876 (owner: 10David Caro) [10:26:44] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [10:27:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, thanks!" (036 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702089 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [10:28:10] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=CREATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:29:21] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) @BStorm / @aborrero as mentioned on IRC I messed up with the list of servers here, inadvertently including those in the row connected to //cl... [10:30:04] jan_drewniak: Dear deployers, time to do the Wikimedia Portals Update deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210712T1030). [10:30:08] !log installing apache updates on an-tool* hosts (affects Turnilo, Yarn, Superset, Hue) briefly [10:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:11] (03PS3) 10David Caro: wmcs.vps.refresh_puppet_certs: better handle puppetmaster swap [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702082 (https://phabricator.wikimedia.org/T274498) [10:31:13] (03CR) 10David Caro: wmcs.vps.refresh_puppet_certs: better handle puppetmaster swap (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702082 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [10:31:15] (03PS3) 10David Caro: wmcs: ran black and isort [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702083 [10:31:17] (03PS3) 10David Caro: wmcs: add default control node to openstack api [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702084 [10:31:19] (03PS3) 10David Caro: wmcs: namespace exceptions [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702085 (https://phabricator.wikimedia.org/T274498) [10:31:21] (03PS3) 10David Caro: wmcs: quote some parameters to openstack [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702086 (https://phabricator.wikimedia.org/T274498) [10:31:23] (03PS3) 10David Caro: wmcs.OpenstackApi: allow soft affinities to be specified [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702087 (https://phabricator.wikimedia.org/T274498) [10:31:25] (03PS3) 10David Caro: wmcs.start_instance_with_prefix: allow passing the affinity [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702088 (https://phabricator.wikimedia.org/T274498) [10:31:27] (03PS3) 10David Caro: wmcs: add kubernetes and kubeadm controllers [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702089 (https://phabricator.wikimedia.org/T274498) [10:31:29] (03PS3) 10David Caro: wmcs.toolforge: add k8s worker add/remove cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702090 (https://phabricator.wikimedia.org/T274498) [10:31:31] (03PS3) 10David Caro: wmcs.toolforge: add task-id to k8s worker cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702091 (https://phabricator.wikimedia.org/T274498) [10:31:33] (03PS3) 10David Caro: wmcs.ceph: rename the ceph controller to CephClusterController [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702929 (https://phabricator.wikimedia.org/T285858) [10:31:35] (03PS3) 10David Caro: wmcs.ceph: add cookbook to bootstrap and add OSDs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702930 (https://phabricator.wikimedia.org/T285858) [10:31:47] (03CR) 10David Caro: [C: 03+2] puppet.refresh_certs: don't rely on puppet return code [software/spicerack] - 10https://gerrit.wikimedia.org/r/701876 (owner: 10David Caro) [10:32:16] (03CR) 10Majavah: wmcs: add kubernetes and kubeadm controllers (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702089 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [10:32:37] (03CR) 10David Caro: wmcs.vps.refresh_puppet_certs: better handle puppetmaster swap (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702082 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [10:33:20] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [10:33:53] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) @Bstorm / @aborrero as mentioned on IRC I messed up with the list of servers here, inadvertently including those in the row connected to cl... [10:34:31] (03CR) 10David Caro: wmcs.start_instance_with_prefix: allow passing the affinity (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702088 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [10:35:14] (03PS6) 10Hnowlan: maps: fix osm sync directory path [puppet] - 10https://gerrit.wikimedia.org/r/701558 (owner: 10MSantos) [10:36:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10jijiki) @Cmjohnson @Jclark-ctr We would like to start putting those servers in production, is it possible to update or complete any actions... [10:37:53] (03Merged) 10jenkins-bot: puppet.refresh_certs: don't rely on puppet return code [software/spicerack] - 10https://gerrit.wikimedia.org/r/701876 (owner: 10David Caro) [10:39:05] (03PS4) 10David Caro: wmcs: add kubernetes and kubeadm controllers [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702089 (https://phabricator.wikimedia.org/T274498) [10:39:07] (03CR) 10David Caro: wmcs: add kubernetes and kubeadm controllers (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702089 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [10:39:09] (03PS4) 10David Caro: wmcs.toolforge: add k8s worker add/remove cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702090 (https://phabricator.wikimedia.org/T274498) [10:39:11] (03PS4) 10David Caro: wmcs.toolforge: add task-id to k8s worker cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702091 (https://phabricator.wikimedia.org/T274498) [10:39:13] (03PS4) 10David Caro: wmcs.ceph: rename the ceph controller to CephClusterController [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702929 (https://phabricator.wikimedia.org/T285858) [10:39:15] (03PS4) 10David Caro: wmcs.ceph: add cookbook to bootstrap and add OSDs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702930 (https://phabricator.wikimedia.org/T285858) [10:39:17] (03CR) 10Arturo Borrero Gonzalez: wmcs.toolforge: add k8s worker add/remove cookbooks (033 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702090 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [10:40:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.ceph: rename the ceph controller to CephClusterController [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702929 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro) [10:43:48] 10SRE, 10DBA, 10Datacenter-Switchover: switchdc should automatically downtime "Read only" checks on DB masters being switched - https://phabricator.wikimedia.org/T285803 (10Volans) If I may add my 2 cents, `icinga-downtime` is a very simple and old bash script that only ensures that a given host is defined i... [10:44:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.ceph: add cookbook to bootstrap and add OSDs (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702930 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro) [10:45:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.vps.refresh_puppet_certs: better handle puppetmaster swap [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702082 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [10:46:10] RECOVERY - SSH on logstash2021.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:46:32] (03CR) 10Elukey: [V: 03+1 C: 03+2] ml_k8s::master: add profile::kubernetes::node [puppet] - 10https://gerrit.wikimedia.org/r/702645 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [10:48:14] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.01498 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [10:55:23] (03PS2) 10Zabe: Add 'editautoreviewprotected' protection level to hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702761 (https://phabricator.wikimedia.org/T275076) [10:57:01] (03PS3) 10WMDE-Fisch: Enable template search improvements on first wikis 1/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703566 (https://phabricator.wikimedia.org/T284553) [10:57:08] (03PS7) 10Hnowlan: maps: reimage maps2008 as buster replica in new cluster [puppet] - 10https://gerrit.wikimedia.org/r/702099 [10:57:10] (03PS3) 10WMDE-Fisch: Enable template search improvements on first wikis 2/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703567 (https://phabricator.wikimedia.org/T284553) [10:58:33] (03PS1) 10Elukey: role::ml_k8s::master: add docker profiles [puppet] - 10https://gerrit.wikimedia.org/r/704088 (https://phabricator.wikimedia.org/T285927) [10:58:57] !log testing a depool of maps2008 to ensure kartotherian load can cope with one less node [10:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210712T1100). [11:00:05] WMDE-Fisch and Zabe: A patch you scheduled for European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:14] \o [11:00:16] o/ [11:00:24] I can deploy today. [11:00:40] I'll self serve :-) [11:00:42] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:01:07] If that's fine ;-) [11:01:21] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2008.codfw.wmnet [11:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:27] WMDE-Fisch: Sure, go ahead. [11:02:36] (03CR) 10WMDE-Fisch: [C: 03+2] Always add 1 prefixsearch match when searching for templates [extensions/VisualEditor] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/703649 (owner: 10WMDE-Fisch) [11:02:42] (03CR) 10WMDE-Fisch: [C: 03+2] "Deploy" [extensions/VisualEditor] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/703649 (owner: 10WMDE-Fisch) [11:03:24] RECOVERY - Disk space on backup2003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=backup2003&var-datasource=codfw+prometheus/ops [11:04:07] (03PS2) 10WMDE-Fisch: Enable transclusion back button on first wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703568 (https://phabricator.wikimedia.org/T284553) [11:04:22] o/ [11:04:28] (I’m late ^^) [11:04:32] (03PS1) 10Urbanecm: Revert "ptwiki: Use celebration logos in new vector" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703894 (https://phabricator.wikimedia.org/T281925) [11:04:35] Just will do this one config change before the others. It's independent of the backport. [11:04:41] (03PS1) 10Urbanecm: Revert "Use ptwiki 20th anniversary logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703895 (https://phabricator.wikimedia.org/T281925) [11:05:03] hm, what happened to the jouncebot message? there’s a weird block in the middle of it [11:05:22] (03PS5) 10David Caro: wmcs: add kubernetes and kubeadm controllers [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702089 (https://phabricator.wikimedia.org/T274498) [11:05:24] (03CR) 10David Caro: wmcs: add kubernetes and kubeadm controllers (036 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702089 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [11:05:26] (03PS5) 10David Caro: wmcs.toolforge: add k8s worker add/remove cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702090 (https://phabricator.wikimedia.org/T274498) [11:05:28] (03CR) 10David Caro: wmcs.toolforge: add k8s worker add/remove cookbooks (033 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702090 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [11:05:30] (03PS5) 10David Caro: wmcs.toolforge: add task-id to k8s worker cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702091 (https://phabricator.wikimedia.org/T274498) [11:05:32] (03PS5) 10David Caro: wmcs.ceph: rename the ceph controller to CephClusterController [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702929 (https://phabricator.wikimedia.org/T285858) [11:05:34] (03PS5) 10David Caro: wmcs.ceph: add cookbook to bootstrap and add OSDs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702930 (https://phabricator.wikimedia.org/T285858) [11:05:36] (03CR) 10David Caro: wmcs.ceph: add cookbook to bootstrap and add OSDs (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702930 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro) [11:05:39] (03PS2) 10Elukey: role::ml_k8s::master: add docker profiles [puppet] - 10https://gerrit.wikimedia.org/r/704088 (https://phabricator.wikimedia.org/T285927) [11:05:43] Lucas_WMDE: there's an additional block in the deployments calendar, maybe that? [11:05:54] (03CR) 10WMDE-Fisch: [C: 03+2] "Deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703568 (https://phabricator.wikimedia.org/T284553) (owner: 10WMDE-Fisch) [11:05:57] yeah, probably [11:06:26] WMDE-Fisch: if the other config patches from you are backport-dependent, would you mind shipping zabe's patch before jenkins merges the backport? Just to save a bit time :) [11:06:35] sure [11:06:42] thx [11:07:27] (03Merged) 10jenkins-bot: Enable transclusion back button on first wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703568 (https://phabricator.wikimedia.org/T284553) (owner: 10WMDE-Fisch) [11:07:31] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30169/console" [puppet] - 10https://gerrit.wikimedia.org/r/704088 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [11:08:01] and i also added two very-last minute patches, which can be also shipped pre-backport, if time permits [11:08:18] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::ml_k8s::master: add docker profiles [puppet] - 10https://gerrit.wikimedia.org/r/704088 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [11:10:10] (03PS2) 10Urbanecm: Revert "ptwiki: Use celebration logos in new vector" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703894 (https://phabricator.wikimedia.org/T286380) [11:10:12] (03PS2) 10Urbanecm: Revert "Use ptwiki 20th anniversary logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703895 (https://phabricator.wikimedia.org/T286380) [11:10:19] (03PS3) 10Urbanecm: Revert "ptwiki: Use celebration logos in new vector" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703894 (https://phabricator.wikimedia.org/T286380) [11:10:21] (03PS3) 10Urbanecm: Revert "Use ptwiki 20th anniversary logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703895 (https://phabricator.wikimedia.org/T286380) [11:11:06] the VisualEditor change is failing in Zuul [11:12:03] !log wmde-fisch@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:703568|Enable transclusion back button on first wikis (T284553)]] (duration: 00m 58s) [11:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:11] T284553: Deploy template search improvements, back button+warning message, and delete button to small set of wikis - https://phabricator.wikimedia.org/T284553 [11:12:57] Lucas_WMDE: Error looks a bit weird [11:13:07] probably just a random browser test failure [11:13:23] zabe: I'll do yours next [11:13:41] (03PS3) 10WMDE-Fisch: Add 'editautoreviewprotected' protection level to hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702761 (https://phabricator.wikimedia.org/T275076) (owner: 10Zabe) [11:14:14] Lucas_WMDE: Was there a command for "resubmit"? I always forget [11:14:33] any comment with +2 CR should do it [11:14:38] assuming the previous build finished failing [11:14:57] you might want to cancel any still-running jobs in Jenkins first [11:15:43] (03CR) 10WMDE-Fisch: [C: 03+2] "Deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702761 (https://phabricator.wikimedia.org/T275076) (owner: 10Zabe) [11:16:06] 10SRE, 10SRE-OnFire, 10observability, 10Patch-For-Review, 10User-jbond: Automated uploads of minimal & comprehensible timeseries metrics for statuspage display - https://phabricator.wikimedia.org/T285569 (10jbond) [11:16:16] WMDE-Fisch: https://integration.wikimedia.org/ci/job/wmf-quibble-vendor-mysql-php72-docker/69288/ is still running [11:16:40] (03Merged) 10jenkins-bot: Add 'editautoreviewprotected' protection level to hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702761 (https://phabricator.wikimedia.org/T275076) (owner: 10Zabe) [11:17:55] Lucas_WMDE: zabe 's config patch is unrelated [11:18:00] Or what do you mean? [11:18:19] I mean that Zuul won’t do anything about your “Deploy” comment, because the previous build is still running [11:18:22] (I think) [11:18:36] so you need to either wait for that to finish, or cancel it (log into Jenkins and find the red x button) [11:18:49] (03CR) 10jerkins-bot: [V: 04-1] Always add 1 prefixsearch match when searching for templates [extensions/VisualEditor] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/703649 (owner: 10WMDE-Fisch) [11:18:52] and then you can comment +2 to kick off another gate-and-submit [11:18:56] ok now it finished on its own ^ [11:19:17] !log testing a depool of maps2010 to ensure kartotherian load can cope with two less nodes [11:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:50] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [11:20:11] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2010.codfw.wmnet [11:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:43] zabe: please test on debug1002 [11:21:18] Lucas_WMDE: looks all fine to me and gate and submit finished for the other patch [11:21:26] or did I miss something? [11:21:29] WMDE-Fisch: shouldn't that be mwdebug2* due to the switchover? [11:22:00] (03CR) 10WMDE-Fisch: [C: 03+2] "Deploy 2nd attempt" [extensions/VisualEditor] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/703649 (owner: 10WMDE-Fisch) [11:22:08] ^ that was what I meant :) [11:22:23] now it’s running again [11:22:29] yeah sorry mwdebug [11:22:41] hmm no wait [11:22:52] majavah: 1002 worked for me before [11:22:59] 1002 works for me [11:23:01] sorry I guess I missed that [11:23:21] WMDE-Fisch: mwdebug1002 will work, but only as a RO server [11:23:34] (as eqiad's master DB is RO) [11:23:35] ahhh [11:23:56] sorry for that, then I guess we're lucky that these two did not need write [11:24:07] if they needed, MW will complain anyway :) [11:24:08] 10SRE, 10Traffic: Enable UDS support on varnish - https://phabricator.wikimedia.org/T285374 (10Vgutierrez) 05Open→03Resolved [11:24:14] 10SRE, 10Traffic: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Vgutierrez) [11:24:39] WMDE-Fisch: my config patch works the supposed way [11:25:17] zabe: Thanks deploying now [11:26:18] !log wmde-fisch@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:702761|Add 'editautoreviewprotected' protection level to hewikisource (T275076)]] (duration: 00m 57s) [11:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:24] T275076: Adding protection levels in hewikisource - https://phabricator.wikimedia.org/T275076 [11:27:13] zabe: done [11:27:19] thanks :) [11:27:20] (03CR) 10Hnowlan: [C: 03+1] Use IcingaHosts instead of Icinga (various) [cookbooks] - 10https://gerrit.wikimedia.org/r/702885 (owner: 10Volans) [11:28:23] urbanecm: can do yours next [11:28:29] cool [11:28:34] will you deploy, or should i? [11:28:41] you go :-) [11:28:45] ok [11:28:55] (03PS4) 10Urbanecm: Revert "ptwiki: Use celebration logos in new vector" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703894 (https://phabricator.wikimedia.org/T286380) [11:29:00] (03CR) 10Urbanecm: [C: 03+2] Revert "ptwiki: Use celebration logos in new vector" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703894 (https://phabricator.wikimedia.org/T286380) (owner: 10Urbanecm) [11:29:08] (03PS4) 10Urbanecm: Revert "Use ptwiki 20th anniversary logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703895 (https://phabricator.wikimedia.org/T286380) [11:29:11] (03CR) 10Urbanecm: [C: 03+2] Revert "Use ptwiki 20th anniversary logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703895 (https://phabricator.wikimedia.org/T286380) (owner: 10Urbanecm) [11:29:46] (03Merged) 10jenkins-bot: Revert "ptwiki: Use celebration logos in new vector" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703894 (https://phabricator.wikimedia.org/T286380) (owner: 10Urbanecm) [11:29:54] (03Merged) 10jenkins-bot: Revert "Use ptwiki 20th anniversary logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703895 (https://phabricator.wikimedia.org/T286380) (owner: 10Urbanecm) [11:33:28] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: cd5f5375b4f712c56e9396cc550078272ef668de: Revert "ptwiki: Use celebration logos in new vector" (T286380) (duration: 00m 57s) [11:33:30] !log hnowlan@puppetmaster1001 conftool action : set/weight=6; selector: name=maps2001.codfw.wmnet [11:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:36] T286380: Revert ptwiki logo temporarily changed - https://phabricator.wikimedia.org/T286380 [11:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:22] !log hnowlan@puppetmaster1001 conftool action : set/weight=6; selector: name=maps2003.codfw.wmnet [11:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:56] !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: 773c956811cba5c3a2cbba32bc1e1a536dbd9f0b: Revert "Use ptwiki 20th anniversary logos" (T286380) (duration: 00m 57s) [11:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:06] WMDE-Fisch: i'm done [11:35:13] urbanecm: thanks [11:35:22] * WMDE-Fisch waiting for jenkins [11:35:44] * urbanecm leaves WMDE-Fisch in the jenkins waiting loop :) [11:35:49] :-) [11:37:10] (03CR) 10Volans: [C: 03+1] "LGTM, thx" [software/spicerack] - 10https://gerrit.wikimedia.org/r/703879 (https://phabricator.wikimedia.org/T285706) (owner: 10RLazarus) [11:37:18] !log hnowlan@puppetmaster1001 conftool action : set/weight=6; selector: name=maps2004.codfw.wmnet [11:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:56] !log adjusting weights of codfw maps servers to reduce load on older spec machines [11:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:07] !log installing apache updates on mw1/eqiad hosts [11:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:50] (03Merged) 10jenkins-bot: Always add 1 prefixsearch match when searching for templates [extensions/VisualEditor] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/703649 (owner: 10WMDE-Fisch) [11:43:54] \o/ [11:45:55] !log adjusting weights of eqiad maps servers to reduce load on older spec machines [11:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:26] !log hnowlan@puppetmaster1001 conftool action : set/weight=6; selector: name=maps100[1-4].eqiad.wmnet [11:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:02] (03PS4) 10WMDE-Fisch: Enable template search improvements on first wikis 1/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703566 (https://phabricator.wikimedia.org/T284553) [11:49:10] (03PS4) 10WMDE-Fisch: Enable template search improvements on first wikis 2/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703567 (https://phabricator.wikimedia.org/T284553) [11:49:31] !log wmde-fisch@deploy1002 Synchronized php-1.37.0-wmf.12/extensions/VisualEditor/modules/ve-mw/ui/widgets/ve.ui.MWTemplateTitleInputWidget.js: Backport: [[gerrit:703649|Always add 1 prefixsearch match when searching for templates]] (duration: 00m 57s) [11:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:24] (03CR) 10Arturo Borrero Gonzalez: metricsinfra: Add HAProxy for distributing http traffic (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/703708 (https://phabricator.wikimedia.org/T286335) (owner: 10Majavah) [11:50:40] (03CR) 10WMDE-Fisch: [C: 03+2] "Deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703566 (https://phabricator.wikimedia.org/T284553) (owner: 10WMDE-Fisch) [11:52:10] (03Merged) 10jenkins-bot: Enable template search improvements on first wikis 1/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703566 (https://phabricator.wikimedia.org/T284553) (owner: 10WMDE-Fisch) [11:53:27] (03CR) 10WMDE-Fisch: [C: 03+2] "Deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703567 (https://phabricator.wikimedia.org/T284553) (owner: 10WMDE-Fisch) [11:54:07] !log wmde-fisch@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:703566|Enable template search improvements on first wikis 1/2 (T284553)]] (duration: 00m 56s) [11:54:08] (03Merged) 10jenkins-bot: Enable template search improvements on first wikis 2/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703567 (https://phabricator.wikimedia.org/T284553) (owner: 10WMDE-Fisch) [11:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:14] T284553: Deploy template search improvements, back button+warning message, and delete button to small set of wikis - https://phabricator.wikimedia.org/T284553 [11:58:28] !log wmde-fisch@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:703567|Enable template search improvements on first wikis 2/2 (T284553)]] (duration: 00m 57s) [11:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:11] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: don't deploy alerts to 'global' instance by default [puppet] - 10https://gerrit.wikimedia.org/r/702599 (https://phabricator.wikimedia.org/T284810) (owner: 10Filippo Giunchedi) [11:59:22] (03PS2) 10Filippo Giunchedi: prometheus: don't deploy alerts to 'global' instance by default [puppet] - 10https://gerrit.wikimedia.org/r/702599 (https://phabricator.wikimedia.org/T284810) [12:00:07] EU deploys done [12:00:12] Right on time [12:07:15] (03PS1) 10Filippo Giunchedi: hieradata: add pki to monitoring::groups [puppet] - 10https://gerrit.wikimedia.org/r/704101 [12:09:05] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add pki to monitoring::groups [puppet] - 10https://gerrit.wikimedia.org/r/704101 (owner: 10Filippo Giunchedi) [12:21:57] (03CR) 10Volans: "> Patch Set 1: Code-Review+1" [cookbooks] - 10https://gerrit.wikimedia.org/r/702886 (owner: 10Volans) [12:29:24] RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [12:29:26] (03PS1) 10Jelto: role::common::mediawiki::canary_appserver add new canary app server in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/704103 (https://phabricator.wikimedia.org/T279309) [12:29:36] (03PS2) 10Volans: Use IcingaHosts instead of Icinga (search) [cookbooks] - 10https://gerrit.wikimedia.org/r/702884 [12:29:38] (03PS2) 10Volans: Use IcingaHosts instead of Icinga (various) [cookbooks] - 10https://gerrit.wikimedia.org/r/702885 [12:29:40] (03PS3) 10Volans: iUse IcingaHosts instead of Icinga (generic) [cookbooks] - 10https://gerrit.wikimedia.org/r/702886 [12:29:42] (03CR) 10Volans: "replies inline" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/702884 (owner: 10Volans) [12:32:18] (03CR) 10Jelto: "Could you please take a look? I changed 5 of the new appservers to canary and recreated the configuration of the old appservers." [puppet] - 10https://gerrit.wikimedia.org/r/704103 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [12:32:19] jelto: why not ^mw14(19|2[0-1])\.eqiad\.wmnet$ instead of ^mw14(1[9]|2[0-1])\.eqiad\.wmnet$ in site.pp [12:34:00] (03PS2) 10Jelto: role::common::mediawiki::canary_appserver add new canary app server in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/704103 (https://phabricator.wikimedia.org/T279309) [12:34:31] !log hnowlan@puppetmaster1001 conftool action : set/weight=5; selector: name=maps2004.codfw.wmnet [12:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:37] (03PS1) 10Elukey: Add ml-serve-ctrl* nodes to the k8s ML iBGP configs [homer/public] - 10https://gerrit.wikimedia.org/r/704104 (https://phabricator.wikimedia.org/T285927) [12:34:38] RhinosF1: thanks, I cleaned up the regex [12:34:54] jelto: ty [12:42:13] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10Volans) There are some issues with this task: 1. The hosts have been provisioned on Netbox with public IPs, see https://netbox.wikimedia.org/ipam/ip-addresses/?q=pc101 {F34548558} 1.... [12:42:20] !log reverting Primary IP allocation for pc1011-1014, leaving only mgmt IPs - T282484 [12:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:27] T282484: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 [12:48:20] !log volans@cumin2002 START - Cookbook sre.dns.netbox [12:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:38] !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:55] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10Volans) I've updated the previous message with the homer part too. Given that we have the pending spurious changes that will open the ports for those hosts with the wrong vlan I'm reve... [13:11:12] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:21:11] 10SRE, 10SRE-Access-Requests: Requesting update to SSH key for Aisha Khatun - https://phabricator.wikimedia.org/T286410 (10Ottomata) No approval needed from me for this one! :) [13:21:30] (03PS1) 10Muehlenhoff: Remove ldap-replica1001/1002/2003/2004 from conf-tool [puppet] - 10https://gerrit.wikimedia.org/r/704113 [13:23:25] (03CR) 10Dzahn: [C: 03+1] "yes, we talked about this. the wancache config is identical on canaries so it makes it much easier to read to put this into common hierada" [puppet] - 10https://gerrit.wikimedia.org/r/704103 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [13:32:55] !log otto@deploy1002 Started deploy [analytics/refinery@1cb9e12]: Add event_default gobblin job - T271232 [13:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:01] T271232: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 [13:34:50] (03PS1) 10Zabe: Removed unused celebration logos and wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704115 (https://phabricator.wikimedia.org/T286380) [13:35:39] (03CR) 10Lucas Werkmeister (WMDE): "I don’t understand why T285098 says that we shouldn’t enable the A/B testing yet. As far as I can tell, the config variable to enable it w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703205 (https://phabricator.wikimedia.org/T285098) (owner: 10Martaannaj) [13:36:32] !log otto@deploy1002 Finished deploy [analytics/refinery@1cb9e12]: Add event_default gobblin job - T271232 (duration: 03m 37s) [13:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:07] (03CR) 10Ottomata: [C: 03+2] Add gobblin job event_default [puppet] - 10https://gerrit.wikimedia.org/r/703867 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [13:37:48] (03PS2) 10Ottomata: Finalze gobblin event migration [puppet] - 10https://gerrit.wikimedia.org/r/703869 (https://phabricator.wikimedia.org/T271232) [13:38:16] (03CR) 10jerkins-bot: [V: 04-1] Finalze gobblin event migration [puppet] - 10https://gerrit.wikimedia.org/r/703869 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [13:38:54] (03CR) 10Muehlenhoff: [C: 03+2] Remove ldap-replica1001/1002/2003/2004 from conf-tool [puppet] - 10https://gerrit.wikimedia.org/r/704113 (owner: 10Muehlenhoff) [13:39:27] (03CR) 10Filippo Giunchedi: "LGTM, see inline and thank you!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/703741 (owner: 10Jbond) [13:43:34] (03CR) 10RLazarus: [C: 03+2] dnsdisc: Improve "failed to check record" error message [software/spicerack] - 10https://gerrit.wikimedia.org/r/703879 (https://phabricator.wikimedia.org/T285706) (owner: 10RLazarus) [13:43:55] (03PS3) 10RLazarus: dnsdisc: Improve "failed to check record" error message [software/spicerack] - 10https://gerrit.wikimedia.org/r/703879 (https://phabricator.wikimedia.org/T285706) [13:44:03] (03CR) 10Zabe: [C: 03+1] Enable RelatedArticles Extension in zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670359 (https://phabricator.wikimedia.org/T266933) (owner: 10Waihorace) [13:44:16] (03CR) 10Jelto: "@Effie @Joe could you also take a quick look? I replicated the configuration of the old canary app servers on mw1414-mw1418. The old canar" [puppet] - 10https://gerrit.wikimedia.org/r/704103 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [13:44:24] (03PS5) 10Zabe: Enable RelatedArticles Extension in zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670359 (https://phabricator.wikimedia.org/T266933) (owner: 10Waihorace) [13:44:26] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670359 (https://phabricator.wikimedia.org/T266933) (owner: 10Waihorace) [13:44:40] (03CR) 10Filippo Giunchedi: [C: 03+1] P:prometheus::ops: drop the site parameter [puppet] - 10https://gerrit.wikimedia.org/r/703740 (owner: 10Jbond) [13:45:01] (03PS1) 10Ssingh: wikidough: use the correct dnsdist version in dnsdist.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/704116 [13:45:37] (03CR) 10Filippo Giunchedi: [C: 03+1] R:prometheus::cluster_config: Drop ERB file [puppet] - 10https://gerrit.wikimedia.org/r/703732 (owner: 10Jbond) [13:46:59] (03PS2) 10Zabe: Remove unused celebration logos and wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704115 (https://phabricator.wikimedia.org/T286380) [13:48:54] (03CR) 10Effie Mouzeli: [C: 04-1] role::common::mediawiki::canary_appserver add new canary app server in eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704103 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [13:49:28] !log otto@deploy1002 Started deploy [analytics/refinery@0149c81]: Set event_default gobblin job max mappers=128 - T271232 [13:49:33] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1003/30173/malmok.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/704116 (owner: 10Ssingh) [13:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:34] T271232: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 [13:49:35] (03PS3) 10Ottomata: Finalze gobblin event migration [puppet] - 10https://gerrit.wikimedia.org/r/703869 (https://phabricator.wikimedia.org/T271232) [13:49:37] (03CR) 10Ssingh: [C: 03+2] wikidough: use the correct dnsdist version in dnsdist.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/704116 (owner: 10Ssingh) [13:52:44] !log otto@deploy1002 Finished deploy [analytics/refinery@0149c81]: Set event_default gobblin job max mappers=128 - T271232 (duration: 03m 16s) [13:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:01] !log otto@deploy1002 Started deploy [analytics/refinery@dd65f38]: event_default gobblin job - fix typo - T271232 [13:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:07] T271232: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 [13:56:42] (03CR) 10Volans: [C: 04-1] "Small typo to fix and a couple of nitpicks. LGTM otherwise." (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/701498 (owner: 10Jbond) [13:59:31] !log otto@deploy1002 Finished deploy [analytics/refinery@dd65f38]: event_default gobblin job - fix typo - T271232 (duration: 03m 30s) [13:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:07] (03CR) 10Klausman: [C: 03+1] Add ml-serve-ctrl* nodes to the k8s ML iBGP configs [homer/public] - 10https://gerrit.wikimedia.org/r/704104 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [14:01:56] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2010.codfw.wmnet [14:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:13] (03CR) 10RLazarus: dnsdisc: Improve "failed to check record" error message [software/spicerack] - 10https://gerrit.wikimedia.org/r/703879 (https://phabricator.wikimedia.org/T285706) (owner: 10RLazarus) [14:03:17] (03CR) 10RLazarus: [C: 03+2] dnsdisc: Improve "failed to check record" error message [software/spicerack] - 10https://gerrit.wikimedia.org/r/703879 (https://phabricator.wikimedia.org/T285706) (owner: 10RLazarus) [14:04:15] 10SRE, 10SRE-Access-Requests: Requesting update to SSH key for Aisha Khatun - https://phabricator.wikimedia.org/T286410 (10CBogen) Don't know if approval from me is needed but if so, here it is :) Any chance we can get this resolved today? Aisha needs this for a demo she's doing tomorrow. Thanks! [14:09:20] (03Merged) 10jenkins-bot: dnsdisc: Improve "failed to check record" error message [software/spicerack] - 10https://gerrit.wikimedia.org/r/703879 (https://phabricator.wikimedia.org/T285706) (owner: 10RLazarus) [14:11:19] (03CR) 10Jbond: [C: 03+1] "> Patch Set 2:" [cookbooks] - 10https://gerrit.wikimedia.org/r/702886 (owner: 10Volans) [14:12:07] (03CR) 10Klausman: [C: 03+1] Add support for knative serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/699380 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [14:12:45] (03CR) 10Klausman: [C: 03+1] knative,kubeflow: improve the import of the build images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/701396 (owner: 10Elukey) [14:12:48] (03PS4) 10Volans: Use IcingaHosts instead of Icinga (generic) [cookbooks] - 10https://gerrit.wikimedia.org/r/702886 [14:13:03] (03PS1) 10Muehlenhoff: Remove remaining LDAP references for ldap-replica1001/1002/2003/2004 [puppet] - 10https://gerrit.wikimedia.org/r/704122 [14:14:16] (03PS2) 10Muehlenhoff: Remove remaining references for ldap-replica1001/1002/2003/2004 [puppet] - 10https://gerrit.wikimedia.org/r/704122 [14:14:20] (03PS3) 10Jbond: P:prometheus::ops: manage target_path [puppet] - 10https://gerrit.wikimedia.org/r/703741 [14:15:37] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ldap-replica2003.wikimedia.org [14:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:29] (03PS1) 10Ssingh: acme_chief: remove malmok's SNI and host from Wikidough certs [puppet] - 10https://gerrit.wikimedia.org/r/704125 (https://phabricator.wikimedia.org/T286480) [14:25:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ldap-replica2003.wikimedia.org [14:25:07] 10SRE, 10vm-requests: eqiad/codfw: 4 VMs requested for LDAP replicas - https://phabricator.wikimedia.org/T281089 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ldap-replica2003.wikimedia.org` - ldap-replica2003.wikimedia.org (**PASS**) - Downtimed host on Icinga... [14:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:43] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ldap-replica2004.wikimedia.org [14:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:00] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30174/console" [puppet] - 10https://gerrit.wikimedia.org/r/703869 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [14:34:42] (03PS10) 10Jbond: sre.idm.logout: create cookbook to logout users [cookbooks] - 10https://gerrit.wikimedia.org/r/701498 [14:34:50] (03CR) 10Jbond: "updated thanks" (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/701498 (owner: 10Jbond) [14:36:30] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/701498 (owner: 10Jbond) [14:37:47] (03CR) 10jerkins-bot: [V: 04-1] sre.idm.logout: create cookbook to logout users [cookbooks] - 10https://gerrit.wikimedia.org/r/701498 (owner: 10Jbond) [14:40:07] (03CR) 10Jforrester: "Thanks!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/703222 (https://phabricator.wikimedia.org/T286212) (owner: 10Legoktm) [14:40:14] (03CR) 10JMeybohm: [C: 03+1] Add ml-serve-ctrl* nodes to the k8s ML iBGP configs [homer/public] - 10https://gerrit.wikimedia.org/r/704104 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [14:42:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ldap-replica2004.wikimedia.org [14:42:53] 10SRE, 10vm-requests: eqiad/codfw: 4 VMs requested for LDAP replicas - https://phabricator.wikimedia.org/T281089 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ldap-replica2004.wikimedia.org` - ldap-replica2004.wikimedia.org (**PASS**) - Downtimed host on Icinga... [14:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:50] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ldap-replica1001.wikimedia.org [14:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:57] (03PS11) 10Jbond: sre.idm.logout: create cookbook to logout users [cookbooks] - 10https://gerrit.wikimedia.org/r/701498 [14:56:50] (03CR) 10Jbond: P:prometheus::ops: manage target_path (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/703741 (owner: 10Jbond) [14:57:01] (03CR) 10Jbond: [V: 03+1 C: 03+2] R:prometheus::cluster_config: Drop ERB file [puppet] - 10https://gerrit.wikimedia.org/r/703732 (owner: 10Jbond) [14:57:11] (03CR) 10Jbond: [C: 03+2] P:prometheus::ops: drop the site parameter [puppet] - 10https://gerrit.wikimedia.org/r/703740 (owner: 10Jbond) [14:57:21] (03PS1) 10Muehlenhoff: Extend access for S&F contractors [puppet] - 10https://gerrit.wikimedia.org/r/704126 [14:58:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ldap-replica1001.wikimedia.org [14:58:04] 10SRE, 10vm-requests: eqiad/codfw: 4 VMs requested for LDAP replicas - https://phabricator.wikimedia.org/T281089 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ldap-replica1001.wikimedia.org` - ldap-replica1001.wikimedia.org (**PASS**) - Downtimed host on Icinga... [14:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:06] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 14 hosts with reason: Deploying schema change T277116 [14:58:11] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 14 hosts with reason: Deploying schema change T277116 [14:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:13] T277116: fa_deleted_timestamp and fa_timestamp are binary(14) in code but varbinary(14) in production - https://phabricator.wikimedia.org/T277116 [14:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:27] (03PS4) 10Jbond: P:prometheus::ops: manage target_path [puppet] - 10https://gerrit.wikimedia.org/r/703741 [15:00:00] (03CR) 10Jbond: [C: 03+2] P:prometheus::ops: manage target_path [puppet] - 10https://gerrit.wikimedia.org/r/703741 (owner: 10Jbond) [15:00:34] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ldap-replica1002.wikimedia.org [15:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:56] 10ops-codfw, 10DC-Ops: (Need By: 2021-08-31) rack/setup/install ganeti-test200[123] - https://phabricator.wikimedia.org/T286484 (10RobH) [15:03:15] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/701498 (owner: 10Jbond) [15:08:16] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 13 hosts with reason: Deploying schema change to s5 T277116 [15:08:21] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 13 hosts with reason: Deploying schema change to s5 T277116 [15:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:22] T277116: fa_deleted_timestamp and fa_timestamp are binary(14) in code but varbinary(14) in production - https://phabricator.wikimedia.org/T277116 [15:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ldap-replica1002.wikimedia.org [15:10:55] 10SRE, 10vm-requests: eqiad/codfw: 4 VMs requested for LDAP replicas - https://phabricator.wikimedia.org/T281089 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ldap-replica1002.wikimedia.org` - ldap-replica1002.wikimedia.org (**PASS**) - Downtimed host on Icinga... [15:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:12] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM, ok to merge any time." [homer/public] - 10https://gerrit.wikimedia.org/r/704104 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [15:15:54] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 15 hosts with reason: Deploying schema change to s2 T277116 [15:15:59] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 15 hosts with reason: Deploying schema change to s2 T277116 [15:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:01] T277116: fa_deleted_timestamp and fa_timestamp are binary(14) in code but varbinary(14) in production - https://phabricator.wikimedia.org/T277116 [15:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:14] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10jijiki) @cmooney should we sent out an email about this to ops@ and possibly add those times/dates to the maintenance calendar? Thank you! [15:19:20] 10SRE, 10Datacenter-Switchover, 10SRE Observability (FY2021/2022-Q1): Switchover thanos-query and thanos-swift services as part of DC switchover - https://phabricator.wikimedia.org/T285273 (10lmata) [15:19:43] 10SRE, 10Datacenter-Switchover, 10SRE Observability (FY2021/2022-Q1), 10User-fgiunchedi: Switchover thanos-query and thanos-swift services as part of DC switchover - https://phabricator.wikimedia.org/T285273 (10fgiunchedi) [15:20:07] 10SRE, 10vm-requests: eqiad/codfw: 4 VMs requested for LDAP replicas - https://phabricator.wikimedia.org/T281089 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff The new VMs are in use for quite a while now. [15:20:18] 10SRE, 10SRE-OnFire, 10SRE Observability (FY2021/2022-Q1): Ensure SRE team has a good understanding of how & when to declare an outage on the status page; & it is easy to do so - https://phabricator.wikimedia.org/T285769 (10lmata) [15:20:21] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for S&F contractors [puppet] - 10https://gerrit.wikimedia.org/r/704126 (owner: 10Muehlenhoff) [15:20:59] 10SRE, 10SRE Observability (FY2021/2022-Q1), 10User-fgiunchedi: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835 (10lmata) [15:22:21] (03CR) 10Muehlenhoff: [C: 03+2] Remove remaining references for ldap-replica1001/1002/2003/2004 [puppet] - 10https://gerrit.wikimedia.org/r/704122 (owner: 10Muehlenhoff) [15:23:58] (03CR) 10Elukey: [C: 03+2] Add ml-serve-ctrl* nodes to the k8s ML iBGP configs [homer/public] - 10https://gerrit.wikimedia.org/r/704104 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [15:24:28] !log expand ML k8s iBGP neighbors to include the master nodes (ref: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/704104) [15:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:21] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10hnowlan) [15:28:25] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 15 hosts with reason: Deploying schema change to s7 T277116 [15:28:30] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 15 hosts with reason: Deploying schema change to s7 T277116 [15:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:31] T277116: fa_deleted_timestamp and fa_timestamp are binary(14) in code but varbinary(14) in production - https://phabricator.wikimedia.org/T277116 [15:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:49] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 18 hosts with reason: Deploying schema change to s4 T277116 [15:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:55] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 18 hosts with reason: Deploying schema change to s4 T277116 [15:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:15] (03CR) 10Jbond: [C: 03+2] sre.idm.logout: create cookbook to logout users [cookbooks] - 10https://gerrit.wikimedia.org/r/701498 (owner: 10Jbond) [15:32:20] (03PS12) 10Jbond: sre.idm.logout: create cookbook to logout users [cookbooks] - 10https://gerrit.wikimedia.org/r/701498 [15:33:23] 10SRE: Request for more CPU and RAM for releases1002/2002 - https://phabricator.wikimedia.org/T284772 (10dancy) Thanks @MoritzMuehlenhoff ! [15:33:42] (03CR) 10Urbanecm: [C: 04-1] "a typo (probably doesn't affect functionality)" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/701498 (owner: 10Jbond) [15:38:03] 10SRE, 10ops-eqiad, 10Analytics-Radar: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10Ottomata) Hiya, checking in! We'd love to move on {T275767}, any new ETA? Thanks! [15:42:33] (03PS1) 10Elukey: Update iBGP neighbor list for the ML k8s clusters [puppet] - 10https://gerrit.wikimedia.org/r/704131 (https://phabricator.wikimedia.org/T285927) [15:45:21] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30175/console" [puppet] - 10https://gerrit.wikimedia.org/r/704131 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [15:45:41] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10MoritzMuehlenhoff) [15:46:00] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10MoritzMuehlenhoff) [15:46:54] (03CR) 10Elukey: [V: 03+1 C: 03+2] Update iBGP neighbor list for the ML k8s clusters [puppet] - 10https://gerrit.wikimedia.org/r/704131 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [15:51:09] (03PS1) 10Elukey: Add k8s iBGP neighbor config to the ML k8s master nodes [puppet] - 10https://gerrit.wikimedia.org/r/704132 (https://phabricator.wikimedia.org/T285927) [15:53:22] (03PS23) 10DCausse: rdf-streaming-updater: switch to H/A session-cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [15:53:24] (03PS8) 10DCausse: Rename chart rdf-streaming-updater as flink-session-cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/693411 [15:53:26] (03PS7) 10DCausse: rdf-streaming-updater: use the flink-session-cluster chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/693416 [15:54:03] !log ppchelko@deploy1002 Started deploy [restbase/deploy@b05ade3]: Add newly created wikis T284929 T284457 T284392 [15:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:11] T284929: Add shiwiki to RESTBase - https://phabricator.wikimedia.org/T284929 [15:54:11] T284392: Add banwikisource to RESTBase - https://phabricator.wikimedia.org/T284392 [15:54:11] T284457: Add dagwiki to RESTBase - https://phabricator.wikimedia.org/T284457 [15:54:13] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30177/console" [puppet] - 10https://gerrit.wikimedia.org/r/704132 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [15:54:59] (03CR) 10jerkins-bot: [V: 04-1] rdf-streaming-updater: use the flink-session-cluster chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/693416 (owner: 10DCausse) [15:55:51] (03PS2) 10Elukey: Add k8s iBGP neighbor config to the ML k8s master nodes [puppet] - 10https://gerrit.wikimedia.org/r/704132 (https://phabricator.wikimedia.org/T285927) [15:56:43] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30178/console" [puppet] - 10https://gerrit.wikimedia.org/r/704132 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [15:59:04] (03PS1) 10Jbond: statograph: add debian folder allowing us to package [software/statograph] - 10https://gerrit.wikimedia.org/r/704133 (https://phabricator.wikimedia.org/T285569) [15:59:33] (03CR) 10Elukey: [V: 03+1 C: 03+2] Add k8s iBGP neighbor config to the ML k8s master nodes [puppet] - 10https://gerrit.wikimedia.org/r/704132 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [16:00:12] (03CR) 10Jbond: "Results of built package" [software/statograph] - 10https://gerrit.wikimedia.org/r/704133 (https://phabricator.wikimedia.org/T285569) (owner: 10Jbond) [16:00:23] (03CR) 10jerkins-bot: [V: 04-1] statograph: add debian folder allowing us to package [software/statograph] - 10https://gerrit.wikimedia.org/r/704133 (https://phabricator.wikimedia.org/T285569) (owner: 10Jbond) [16:01:04] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 18 hosts with reason: Deploying schema change to s4 T277116 - extending downtime [16:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:11] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 18 hosts with reason: Deploying schema change to s4 T277116 - extending downtime [16:01:12] T277116: fa_deleted_timestamp and fa_timestamp are binary(14) in code but varbinary(14) in production - https://phabricator.wikimedia.org/T277116 [16:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:13] (03PS2) 10Jbond: statograph: add debian folder allowing us to package [software/statograph] - 10https://gerrit.wikimedia.org/r/704133 (https://phabricator.wikimedia.org/T285569) [16:08:56] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10sbassett) >>! In T257066#7202898, @Legoktm wrote: > My tentative plan is to re-enable Score on test.wikipedi... [16:15:27] !log ppchelko@deploy1002 Finished deploy [restbase/deploy@b05ade3]: Add newly created wikis T284929 T284457 T284392 (duration: 21m 24s) [16:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:35] T284929: Add shiwiki to RESTBase - https://phabricator.wikimedia.org/T284929 [16:15:35] T284392: Add banwikisource to RESTBase - https://phabricator.wikimedia.org/T284392 [16:15:36] T284457: Add dagwiki to RESTBase - https://phabricator.wikimedia.org/T284457 [16:22:44] (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/704136 [16:27:06] (03PS1) 10Daimona Eaytoy: Avoid passing invalid offset to mb_strpos [extensions/AbuseFilter] (wmf/1.37.0-wmf.13) - 10https://gerrit.wikimedia.org/r/703902 (https://phabricator.wikimedia.org/T285978) [16:28:38] (03PS1) 10Vgutierrez: admin: Update Aisha Khatun SSH key [puppet] - 10https://gerrit.wikimedia.org/r/704139 (https://phabricator.wikimedia.org/T286410) [16:32:23] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/704139 (https://phabricator.wikimedia.org/T286410) (owner: 10Vgutierrez) [16:37:30] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-08-31) rack/setup/install ganeti-test200[123] - https://phabricator.wikimedia.org/T286484 (10RobH) [16:41:17] (03CR) 10Vgutierrez: "pending out of band confirmation from AKhatun" [puppet] - 10https://gerrit.wikimedia.org/r/704139 (https://phabricator.wikimedia.org/T286410) (owner: 10Vgutierrez) [16:43:42] 10SRE, 10DC-Ops: codfw spare pool system for partman testing - https://phabricator.wikimedia.org/T215301 (10RobH) @cdanis, Is testing complete with this host? If so, should we reclaim to spares and/or decom it? [16:44:42] (03CR) 10Vgutierrez: [C: 03+2] "acked via email" [puppet] - 10https://gerrit.wikimedia.org/r/704139 (https://phabricator.wikimedia.org/T286410) (owner: 10Vgutierrez) [16:48:46] 10SRE, 10Traffic, 10serviceops, 10User-jijiki: Access mwdebug kubernetes deployment via the 'X-Wikimedia-Debug' header - https://phabricator.wikimedia.org/T286491 (10jijiki) [16:49:00] 10SRE, 10Traffic, 10serviceops, 10User-jijiki: Access mwdebug kubernetes deployment via the 'X-Wikimedia-Debug' header - https://phabricator.wikimedia.org/T286491 (10jijiki) [16:49:06] 10SRE, 10MW-on-K8s, 10serviceops, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10jijiki) [16:51:08] PROBLEM - SSH on logstash2021.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:51:37] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting update to SSH key for Aisha Khatun - https://phabricator.wikimedia.org/T286410 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez SSH key updated after out of band confirmation via email. I've triggered a puppet run on bastion hosts. The rest of... [16:52:07] 10SRE, 10DC-Ops: codfw spare pool system for partman testing - https://phabricator.wikimedia.org/T215301 (10MoritzMuehlenhoff) These days we have sretest*, so should be good to reclaim. [16:52:21] * kormat perks up [16:53:10] ah hah. no objection here [16:54:50] (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/704140 [16:54:52] (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/704141 [16:54:57] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Inductiveload) @Legoktm some at enwikisource are pretty desperate to get this functionality back, so I imagi... [17:00:04] ryankemper: Dear deployers, time to do the Wikidata Query Service weekly deploy deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210712T1700). [17:06:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.toolforge: add k8s worker add/remove cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702090 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [17:08:38] (03PS1) 10Brennen Bearnes: explicitly set ansible_python_interpreter to python3 [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/704143 [17:10:05] (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/704144 [17:10:07] (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/704145 [17:12:26] PROBLEM - Disk space on elastic1039 is CRITICAL: DISK CRITICAL - free space: / 1673 MB (6% inode=94%): /tmp 1673 MB (6% inode=94%): /var/tmp 1673 MB (6% inode=94%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1039&var-datasource=eqiad+prometheus/ops [17:16:23] ryankemper: FYI ^^^ [17:16:48] volans: thx [17:18:08] ACKNOWLEDGEMENT - Disk space on elastic1039 is CRITICAL: DISK CRITICAL - free space: / 1673 MB (6% inode=94%): /tmp 1673 MB (6% inode=94%): /var/tmp 1673 MB (6% inode=94%): Ryan Kemper https://phabricator.wikimedia.org/T285643 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1039&var-datasource=eqiad+prometheus/ops [17:18:08] ACKNOWLEDGEMENT - MD RAID on elastic1039 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 0, Spare: 0 Ryan Kemper https://phabricator.wikimedia.org/T285643 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:23:53] (03PS1) 10Daimona Eaytoy: Avoid passing invalid offset to mb_strpos [extensions/AbuseFilter] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/703904 (https://phabricator.wikimedia.org/T285978) [17:26:32] 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: for - https://phabricator.wikimedia.org/T286497 (10RKemper) [17:28:46] 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: for - https://phabricator.wikimedia.org/T286497 (10RKemper) [17:33:18] RECOVERY - Disk space on elastic1039 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1039&var-datasource=eqiad+prometheus/ops [17:33:41] 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: Disk failure for elastic1039.eqiad.wmnet - https://phabricator.wikimedia.org/T286497 (10RKemper) [17:34:00] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Legoktm) >>! In T257066#7206368, @Inductiveload wrote: > @Legoktm some at enwikisource are pretty desperate... [17:37:18] 10SRE, 10ops-eqiad, 10Discovery, 10Discovery-Search (Current work): Degraded RAID on elastic1039 - https://phabricator.wikimedia.org/T285643 (10RKemper) See https://phabricator.wikimedia.org/T286497 for HW maint request [17:38:03] 10SRE, 10Machine-Learning-Team, 10serviceops, 10Kubernetes, 10Patch-For-Review: Add the possibility to deploy calico on kubernetes master nodes - https://phabricator.wikimedia.org/T285927 (10elukey) Kubelets / calico / bird are deployed on the ml-serve-ctrl nodes, but the istio webhook svc seems not reac... [17:41:14] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:46:48] (03CR) 10Aaron Schulz: [C: 03+1] [Beta Cluster] Disable wgEnableWANCacheReaper experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677731 (owner: 10Krinkle) [17:48:45] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10RobH) a:05Cmjohnson→03RobH [17:58:21] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Ankry) We are using some tools in plwikisource that require the raw mode. The tools are intended to choose a... [18:00:05] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210712T1800) [18:00:05] legoktm: A patch you scheduled for Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:42] hm, jouncebot needs a bit of tweaking for the new wording thcipriani ^ [18:01:01] hrm, so it seems [18:01:51] Maybe I should just move that to a new line and be lazy. [18:02:21] legoktm: i guess you'll self deploy, right? :) [18:02:27] oh, hrm, it is on a new line. Guess I'll have to look at code. [18:02:49] urbanecm: yep [18:05:34] (03PS3) 10Legoktm: Enable Score using Shellbox on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703790 (https://phabricator.wikimedia.org/T257066) [18:06:40] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Inductiveload) OK, I have proposed it: https://en.wikisource.org/wiki/Wikisource:Scriptorium#Proposal:_Volun... [18:09:17] (03PS1) 10Legoktm: Disable Score on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704149 (https://phabricator.wikimedia.org/T257066) [18:09:36] (03PS2) 10Legoktm: Uninstall Score on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704149 (https://phabricator.wikimedia.org/T257066) [18:09:40] (03CR) 10Legoktm: [C: 03+2] Enable Score using Shellbox on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703790 (https://phabricator.wikimedia.org/T257066) (owner: 10Legoktm) [18:10:22] (03Merged) 10jenkins-bot: Enable Score using Shellbox on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703790 (https://phabricator.wikimedia.org/T257066) (owner: 10Legoktm) [18:12:16] !log legoktm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Enable Score using Shellbox on testwiki (T257066) (duration: 00m 58s) [18:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:24] T257066: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 [18:20:19] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Legoktm) >>! In T257066#7206576, @Ankry wrote: > We are using some tools in plwikisource that require the ra... [18:30:00] 10SRE, 10DBA, 10Datacenter-Switchover: switchdc should automatically downtime "Read only" checks on DB masters being switched - https://phabricator.wikimedia.org/T285803 (10RLazarus) a:03RLazarus Oh, that does sound better! I didn't realize `icinga-status` was out there, thanks for the pointer. Your plan s... [18:37:27] !log otto@deploy1002 Started deploy [analytics/refinery@200b502]: Finalize event_default gobblin job - T271232 [18:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:34] T271232: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 [18:39:50] 10ops-drmrs, 10DC-Ops: (Need By: TBD) rack/setup/install cp60[01-16] - https://phabricator.wikimedia.org/T286504 (10RobH) [18:40:02] 10ops-drmrs, 10DC-Ops: (Need By: TBD) rack/setup/install cp60[01-16] - https://phabricator.wikimedia.org/T286504 (10RobH) [18:41:06] !log otto@deploy1002 Finished deploy [analytics/refinery@200b502]: Finalize event_default gobblin job - T271232 (duration: 03m 39s) [18:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:26] (03CR) 10Ottomata: [V: 03+1 C: 03+2] Finalze gobblin event migration [puppet] - 10https://gerrit.wikimedia.org/r/703869 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [18:46:15] 10ops-drmrs, 10DC-Ops: (Need By: TBD) rack/setup/install drmrs non-cp-hosts - https://phabricator.wikimedia.org/T286507 (10RobH) [18:46:34] 10ops-drmrs, 10DC-Ops: (Need By: TBD) rack/setup/install drmrs non-cp-hosts - https://phabricator.wikimedia.org/T286507 (10RobH) [18:49:05] (03PS1) 10Ottomata: Refine event - fix input_path_regex_capture_groups param [puppet] - 10https://gerrit.wikimedia.org/r/704154 (https://phabricator.wikimedia.org/T271232) [18:50:29] (03CR) 10Ottomata: [C: 03+2] Refine event - fix input_path_regex_capture_groups param [puppet] - 10https://gerrit.wikimedia.org/r/704154 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [18:55:50] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Dwisehaupt) [18:56:40] legoktm: you're done, right? [18:56:44] (03PS3) 10Krinkle: [Beta Cluster] Disable wgEnableWANCacheReaper experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677731 [18:56:48] yep [18:56:52] (03CR) 10Krinkle: [C: 03+2] [Beta Cluster] Disable wgEnableWANCacheReaper experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677731 (owner: 10Krinkle) [18:56:58] 10SRE, 10DBA, 10Datacenter-Switchover: switchdc should automatically downtime "Read only" checks on DB masters being switched - https://phabricator.wikimedia.org/T285803 (10Volans) >>! In T285803#7206669, @RLazarus wrote: > I'll start by refactoring the icinga module to write to the command file directly (f... [18:57:31] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Dwisehaupt) Still need to confirm the window with Advancement, but it is looking ok right now. There will be some work on the FR-Tech side to ensure d... [18:57:33] (03Merged) 10jenkins-bot: [Beta Cluster] Disable wgEnableWANCacheReaper experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677731 (owner: 10Krinkle) [18:57:44] legoktm: k :) btw, the warmup issue, shall we plan a time soon to do a run outside a switch over to see if we can reproduce the issue and gather intel? https://phabricator.wikimedia.org/T285802 [18:58:51] sure, we can do it whenever today or tomorrow [18:59:18] I can't, probably a bit later like next week. [18:59:36] or Thursday? [19:00:49] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Disable moderation mail notifications for messages sent to archived lists - https://phabricator.wikimedia.org/T286371 (10Quiddity) >>! In T286371#7202536, @Ladsgroup wrote: > The underlying problem is much harder to fix, we didn't have standard concept of "... [19:07:21] Krinkle: I'm on vacation starting Weds, I'll be back next Weds [19:07:57] k [19:08:32] sent invite for next thu [19:10:13] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Ankry) Hard to do. It is a piece of LUA and requires proofreage namespaces. It would need not trivial portin... [19:10:41] 10SRE, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1005 (now stat1007), and stat1006] and groups for iflorez - https://phabricator.wikimedia.org/T223496 (10Iflorez) Hello I've joined the Product Analytics team with @mpopov as my manager. Hooorah! I believe that I am currently added to... [19:11:17] (03PS3) 10Majavah: kubeadm: Upgrade Calico to v3.18.4 [puppet] - 10https://gerrit.wikimedia.org/r/703061 (https://phabricator.wikimedia.org/T280342) [19:12:23] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Legoktm) @ankry can you provide a link to an example of a page using these complex scores? [19:12:51] (03CR) 10Bstorm: "The main feedback I have here is that the API version upgrades should be something that k8s does correctly on `kubectl apply`, but it coul" [puppet] - 10https://gerrit.wikimedia.org/r/703061 (https://phabricator.wikimedia.org/T280342) (owner: 10Majavah) [19:22:12] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Ankry) https://pl.wikisource.org/wiki/Pro%C5%9Bba_dziewcz%C4%99cia Transcluded using ProofreadPage and LUA f... [19:22:35] 10SRE, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1005 (now stat1007), and stat1006] and groups for iflorez - https://phabricator.wikimedia.org/T223496 (10Majavah) >>! In T223496#7206847, @Iflorez wrote: > Hello I've joined the Product Analytics team with @mpopov as my manager. Hooorah... [19:29:05] 10SRE, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1005 (now stat1007), and stat1006] and groups for iflorez - https://phabricator.wikimedia.org/T223496 (10mpopov) Created new request: T286509 [19:33:20] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Legoktm) >>! In T257066#7206860, @Ankry wrote: > https://pl.wikisource.org/wiki/Strona:Pro%C5%9Bba_dziewcz%C... [19:43:46] (03CR) 10Bstorm: [C: 03+2] kubeadm: Upgrade Calico to v3.18.4 [puppet] - 10https://gerrit.wikimedia.org/r/703061 (https://phabricator.wikimedia.org/T280342) (owner: 10Majavah) [19:44:02] (03CR) 10Bstorm: [C: 03+2] "Let's give it a try in toolsbeta 😊" [puppet] - 10https://gerrit.wikimedia.org/r/703061 (https://phabricator.wikimedia.org/T280342) (owner: 10Majavah) [19:53:02] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Ankry) >>! In T257066#7206899, @Legoktm wrote: >>>! In T257066#7206860, @Ankry wrote: >> https://pl.wikisour... [19:53:34] (03PS4) 10H.krishna123: [WIP] api_db: Add code to enable database connection, load config from alerting yaml file [software/bernard] - 10https://gerrit.wikimedia.org/r/702781 (https://phabricator.wikimedia.org/T285142) [20:00:05] chrisalbon and accraze: #bothumor I � Unicode. All rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210712T2000). [20:10:08] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10RobH) [20:11:44] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10RobH) [20:18:05] (03PS1) 10RobH: backup100[4567] setup params [puppet] - 10https://gerrit.wikimedia.org/r/704158 (https://phabricator.wikimedia.org/T277327) [20:18:32] (03CR) 10jerkins-bot: [V: 04-1] backup100[4567] setup params [puppet] - 10https://gerrit.wikimedia.org/r/704158 (https://phabricator.wikimedia.org/T277327) (owner: 10RobH) [20:19:55] (03PS2) 10RobH: backup100[4567] setup params [puppet] - 10https://gerrit.wikimedia.org/r/704158 (https://phabricator.wikimedia.org/T277327) [20:21:28] (03CR) 10RobH: [C: 03+2] backup100[4567] setup params [puppet] - 10https://gerrit.wikimedia.org/r/704158 (https://phabricator.wikimedia.org/T277327) (owner: 10RobH) [20:23:40] (03CR) 10Nikki Nikkhoui: Initial image-suggestion-api helm chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/688358 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui) [20:23:43] Bit of a strange date format I'm seeing on planet.wikimedia: "22:05, Sunday, 04 2020 October UTC" [20:24:27] My guess is that some kind of date format string ended up interpreted by different software as intended since that order and selection of information seems very... odd. [20:24:35] mutante: ^ [20:25:11] lol my brain absolutely hates that [20:25:13] (03PS1) 10Ottomata: Add gobbln job eventlogging_legacy [puppet] - 10https://gerrit.wikimedia.org/r/704159 (https://phabricator.wikimedia.org/T271232) [20:25:15] its screaming. [20:25:36] im going to refrain linking the xkcd iso date comic someone else can do it ;D [20:27:40] robh: sure, https://xkcd.com/1179/ [20:27:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['backup1004.eqiad.wmnet', 'backup1005.eqiad.wmnet'... [20:34:51] (03PS1) 10Ottomata: Finalize eventlogging_legacy gobblin job migration [puppet] - 10https://gerrit.wikimedia.org/r/704161 (https://phabricator.wikimedia.org/T271232) [20:35:24] (03CR) 10jerkins-bot: [V: 04-1] Finalize eventlogging_legacy gobblin job migration [puppet] - 10https://gerrit.wikimedia.org/r/704161 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [20:36:39] (03PS2) 10Ottomata: Finalize eventlogging_legacy gobblin job migration [puppet] - 10https://gerrit.wikimedia.org/r/704161 (https://phabricator.wikimedia.org/T271232) [20:39:27] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Ankry) >>! In T257066#7206936, @Ankry wrote: >>>! In T257066#7206899, @Legoktm wrote: >>>>! In T257066#72068... [20:43:46] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:57:36] jouncebot: now [20:57:36] For the next 0 hour(s) and 2 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210712T2000) [21:00:05] Reedy and sbassett: Dear deployers, time to do the Weekly Security deployment window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210712T2100). [21:10:50] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Doing): Check out www-portals repo in the mediawiki-webserver and in the mediawiki-multiversion images - https://phabricator.wikimedia.org/T285325 (10jeena) a:05jeena→03None [21:15:48] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10RobH) These failed for not liking the specified partition recipie, which was set by someone else, so I need to investigate whats up. [21:26:32] !log Start server-side upload for 2 video files (T286432, T286433) [21:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:40] T286432: Server side upload for 고려 - https://phabricator.wikimedia.org/T286432 [21:26:41] T286433: Server side upload for 고려 - https://phabricator.wikimedia.org/T286433 [21:28:20] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Bstorm) >>! In T286065#7205088, @cmooney wrote: > @BStorm / @aborrero as mentioned on IRC I messed up with the list of servers here, inadvertently inc... [21:31:19] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Bstorm) @cmooney Do the cloudsw switches get impacted by row B updates? [21:55:40] 10ops-eqiad: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 (10RobH) So I just happened to notice this, but in the future, please file requests [[ https://phabricator.wikimedia.org/maniphest/task/edit/form/55/ | using the form ]], as it outlines what has to happen. One of those things is assi... [21:57:36] 10ops-codfw: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 (10RobH) [21:58:07] 10ops-codfw: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 (10RobH) a:03RobH [22:12:22] PROBLEM - Host mw2383 is DOWN: PING CRITICAL - Packet loss = 100% [22:17:06] RECOVERY - Host mw2383 is UP: PING OK - Packet loss = 0%, RTA = 31.67 ms [22:34:07] 10ops-codfw: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 (10RobH) a:05RobH→03jijiki Effie, I updated the firmware on this to the latest version, it hadn't been updated since it's purchase and was a couple of revisions out of date. It is now sitting back ready to be placed into service... [22:40:57] (03PS2) 10Bstorm: cloud nfs: cleaning up the non-drbd setup [puppet] - 10https://gerrit.wikimedia.org/r/702738 (https://phabricator.wikimedia.org/T224747) [22:43:07] (03CR) 10Bstorm: [C: 03+2] "I'm going to try to set this up. If it works, it'll be one less thing that has to worry about toolsdb maintenance 😊" [puppet] - 10https://gerrit.wikimedia.org/r/699471 (https://phabricator.wikimedia.org/T267683) (owner: 10Bstorm) [22:46:58] PROBLEM - Disk space on urldownloader2002 is CRITICAL: DISK CRITICAL - free space: / 340 MB (3% inode=85%): /tmp 340 MB (3% inode=85%): /var/tmp 340 MB (3% inode=85%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=urldownloader2002&var-datasource=codfw+prometheus/ops [22:50:36] o.o [22:52:17] /var/spool/squid/netdb.state is 2.5G [22:53:45] !log root@urldownloader2002:/var/cache/apt# rm -rf * to free up space [22:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:07] (03PS1) 10Zabe: Add 'editautoreviewprotected' to bot on hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704191 (https://phabricator.wikimedia.org/T275076) [22:59:36] (03CR) 10Bstorm: "PCC for the servers:" [puppet] - 10https://gerrit.wikimedia.org/r/702738 (https://phabricator.wikimedia.org/T224747) (owner: 10Bstorm) [23:00:04] RoanKattouw, Niharika, and Urbanecm: That opportune time is upon us again. Time for a Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210712T2300). [23:00:04] dontpanic and zabe: A patch you scheduled for Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:10] I can deploy today [23:00:13] o/ [23:00:17] dontpanic: zabe: hello! [23:00:18] 10SRE: urldownloader2002 running out of disk space in root partition - https://phabricator.wikimedia.org/T286525 (10Legoktm) [23:00:50] hi [23:01:11] hey urbanecm :D [23:01:54] (03PS2) 10Urbanecm: zhwiktionary: Add Reconstruction namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703480 (https://phabricator.wikimedia.org/T286101) (owner: 10Tks4Fish) [23:02:05] (03PS3) 10Urbanecm: zhwiktionary: Add Reconstruction namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703480 (https://phabricator.wikimedia.org/T286101) (owner: 10Tks4Fish) [23:02:10] (03CR) 10Urbanecm: [C: 03+2] zhwiktionary: Add Reconstruction namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703480 (https://phabricator.wikimedia.org/T286101) (owner: 10Tks4Fish) [23:02:33] (03PS2) 10Urbanecm: zhwiktionary: Add aliases for namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703481 (https://phabricator.wikimedia.org/T286101) (owner: 10Tks4Fish) [23:02:53] (03Merged) 10jenkins-bot: zhwiktionary: Add Reconstruction namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703480 (https://phabricator.wikimedia.org/T286101) (owner: 10Tks4Fish) [23:03:13] dontpanic: first patch is available at mwdebug2001, can you check? [23:03:24] (03CR) 10Urbanecm: [C: 03+2] zhwiktionary: Add aliases for namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703481 (https://phabricator.wikimedia.org/T286101) (owner: 10Tks4Fish) [23:03:36] (03PS2) 10Urbanecm: zhwiktionary: Add templateeditor right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703482 (https://phabricator.wikimedia.org/T286101) (owner: 10Tks4Fish) [23:03:42] sure [23:04:11] (03Merged) 10jenkins-bot: zhwiktionary: Add aliases for namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703481 (https://phabricator.wikimedia.org/T286101) (owner: 10Tks4Fish) [23:05:13] urbanecm: I don't see it on Special:PrefixIndex [23:05:50] dontpanic: i do [23:06:02] check the debug server -- it's mwdebug2001 [23:06:06] oh, now I see it [23:06:11] think it was cache issues [23:06:18] also a possibility [23:06:19] anyway, syncing [23:06:58] (03CR) 10Urbanecm: [C: 03+2] zhwiktionary: Add templateeditor right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703482 (https://phabricator.wikimedia.org/T286101) (owner: 10Tks4Fish) [23:07:37] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: ba0967f5c18652d02b7b476e9592b81dcb9b74fc: zhwiktionary: Add Reconstruction namespace (T286101) (duration: 00m 57s) [23:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:44] T286101: zhwiktionary namespace aliases and more - https://phabricator.wikimedia.org/T286101 [23:07:45] (03Merged) 10jenkins-bot: zhwiktionary: Add templateeditor right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703482 (https://phabricator.wikimedia.org/T286101) (owner: 10Tks4Fish) [23:07:50] dontpanic: should be live [23:07:52] RECOVERY - Disk space on urldownloader2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=urldownloader2002&var-datasource=codfw+prometheus/ops [23:08:28] yes, looks good urbanecm :) [23:08:39] dontpanic: good :). Aliases pulled to mwdebug2001 [23:10:54] all there :) [23:11:07] great, syncing [23:12:31] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 5822b2be129b934939af46bab5b8916039661e97: zhwiktionary: Add aliases for namespaces (T286101) (duration: 00m 57s) [23:12:36] should be live [23:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:49] dontpanic: templateeditor pulled to mwdebug2001 [23:13:26] yep, it's there :) [23:13:57] cool [23:13:58] syncing [23:15:49] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 5ab00d188bc4161e40455b842f613698548b3518: zhwiktionary: Add templateeditor right (T286101) (duration: 00m 57s) [23:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:56] T286101: zhwiktionary namespace aliases and more - https://phabricator.wikimedia.org/T286101 [23:15:57] !log urbanecm@mwmaint1002:~$ mwscript namespaceDupes.php --wiki=zhwiktionary --fix --add-prefix=BROKEN # T286101, P16817 [23:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:03] dontpanic: and live :) [23:16:11] thanks for the patches [23:16:20] everything's there (aliases and rights), thanks for deploying! [23:16:25] (03PS6) 10Urbanecm: Enable RelatedArticles Extension in zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670359 (https://phabricator.wikimedia.org/T266933) (owner: 10Waihorace) [23:16:37] (03CR) 10Urbanecm: [C: 03+2] Enable RelatedArticles Extension in zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670359 (https://phabricator.wikimedia.org/T266933) (owner: 10Waihorace) [23:16:47] any time dontpanic :) [23:17:34] (03Merged) 10jenkins-bot: Enable RelatedArticles Extension in zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670359 (https://phabricator.wikimedia.org/T266933) (owner: 10Waihorace) [23:18:49] Hello, I would like to make a patch for https://phabricator.wikimedia.org/T286396 anyways. Is there any chance to have it deployed in this SWAT window? [23:18:59] *backport window [23:19:01] zabe: pulled to mwdebug2001, please have a look [23:19:18] Kizule: if you're able to upload a patch, sure [23:19:37] Thanks urbanecm, I'm going to do it now. [23:21:24] zabe: how's your test going? 🙂 [23:21:55] please give me another minute :) [23:22:04] sure, take your time :) [23:22:40] (03PS2) 10Urbanecm: Add 'editautoreviewprotected' to bot on hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704191 (https://phabricator.wikimedia.org/T275076) (owner: 10Zabe) [23:22:44] (03CR) 10Urbanecm: [C: 03+2] Add 'editautoreviewprotected' to bot on hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704191 (https://phabricator.wikimedia.org/T275076) (owner: 10Zabe) [23:23:25] urbanecm: works, I forgot that wgRelatedArticlesFooterAllowedSkins doesn't contain vector [23:23:27] (03Merged) 10jenkins-bot: Add 'editautoreviewprotected' to bot on hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704191 (https://phabricator.wikimedia.org/T275076) (owner: 10Zabe) [23:23:35] great, syncing :) [23:23:47] (03PS3) 10Zoranzoki21: Add few namespace aliases for Serbian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704169 (https://phabricator.wikimedia.org/T286396) [23:24:59] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 40eade4131eac95ba3dc0d918ad540070d7bcb99: Enable RelatedArticles Extension in zhwikinews (T266933) (duration: 00m 57s) [23:25:04] zabe: live :) [23:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:06] T266933: Enable RelatedArticles Extension in zhwikinews - https://phabricator.wikimedia.org/T266933 [23:25:23] zabe: the hewikisource patch is at mwdebug2001 now [23:26:08] urbanecm: works the supposed way [23:26:13] excellent, syncing [23:26:50] (03PS3) 10Urbanecm: Remove unused celebration logos and wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704115 (https://phabricator.wikimedia.org/T286380) (owner: 10Zabe) [23:26:55] (03CR) 10Urbanecm: [C: 03+2] Remove unused celebration logos and wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704115 (https://phabricator.wikimedia.org/T286380) (owner: 10Zabe) [23:27:08] zabe: i'll deploy the last patch w/o a test, as there's nothing to test anyway [23:27:25] (I won't purge the caches though, in case a client has the old URLs cached) [23:27:35] (03Merged) 10jenkins-bot: Remove unused celebration logos and wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704115 (https://phabricator.wikimedia.org/T286380) (owner: 10Zabe) [23:27:42] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 6c581493fbe5d9c372fd44635b704d04040d8b38: Add editautoreviewprotected to bot on hewikisource (T275076) (duration: 00m 57s) [23:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:48] T275076: Adding protection levels in hewikisource - https://phabricator.wikimedia.org/T275076 [23:28:47] (03PS2) 10Urbanecm: enwiki: Delete Book namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703601 (https://phabricator.wikimedia.org/T285766) [23:28:52] (03CR) 10Urbanecm: [C: 03+2] enwiki: Delete Book namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703601 (https://phabricator.wikimedia.org/T285766) (owner: 10Urbanecm) [23:29:12] !log urbanecm@deploy1002 Synchronized static/images/: d007b9ccb77db9f3dc492df7a35477e5563a921a: Remove unused celebration logos and wordmark (T286380) (duration: 00m 57s) [23:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:20] T286380: Revert ptwiki logo temporarily changed - https://phabricator.wikimedia.org/T286380 [23:29:26] zabe: all should be live [23:29:37] thanks for your help :) [23:29:39] I'm ready urbanecm. :) [23:29:51] (03Merged) 10jenkins-bot: enwiki: Delete Book namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703601 (https://phabricator.wikimedia.org/T285766) (owner: 10Urbanecm) [23:30:03] Kizule: please add your patch to the calendar -- thanks! [23:30:13] Okay, I will add it now. Meanwhile: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/704169 [23:31:31] urbanecm: It's done. [23:32:38] thanks [23:32:44] will get to you once my patch is synced [23:32:51] urbanecm: No problem [23:33:31] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 8a79bf752ff5eb15f3042fd94ba10c2c50607a85: enwiki: Delete Book namespace (T285766) (duration: 00m 57s) [23:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:37] T285766: Remove the Book namespace from enwiki - https://phabricator.wikimedia.org/T285766 [23:34:03] (03PS4) 10Urbanecm: Add few namespace aliases for Serbian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704169 (https://phabricator.wikimedia.org/T286396) (owner: 10Zoranzoki21) [23:34:12] (03CR) 10Urbanecm: [C: 03+2] Add few namespace aliases for Serbian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704169 (https://phabricator.wikimedia.org/T286396) (owner: 10Zoranzoki21) [23:35:05] (03Merged) 10jenkins-bot: Add few namespace aliases for Serbian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704169 (https://phabricator.wikimedia.org/T286396) (owner: 10Zoranzoki21) [23:35:33] Kizule: pulled to mwdebug2001, can you have a look? [23:35:52] urbanecm: Sure, but you will need to run namespaceDupes.php. [23:36:04] it should be testable w/o that script though [23:36:23] stuff like sr.wikipedia.org/wiki/WP:A needs to redirect to the actual NS [23:36:26] which it does... [23:36:28] ...so syncing [23:37:23] urbanecm: Correct [23:37:32] I tested, it is good. [23:38:26] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 284216a7d35c815ea203a9c0bd738a1e1bf31f7e: Add few namespace aliases for Serbian Wikipedia (T286396) (duration: 00m 56s) [23:38:31] Kizule: live [23:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:34] T286396: Add ВП and WP aliases on srwiki - https://phabricator.wikimedia.org/T286396 [23:38:35] runnig that script [23:38:54] urbanecm: Okay, I see. [23:39:26] and... [23:39:30] ...that script fatalerrors [23:39:30] meh [23:39:47] Fatal error? [23:39:56] yup, this one https://www.irccloud.com/pastebin/0oR4VT4u/ [23:40:06] * urbanecm checking what's going on... [23:41:30] If something is wrong with page, you can delete it. [23:41:49] running the script the second time made it finish without issues [23:42:53] Everything should be okay on wiki. [23:45:45] !log urbanecm@mwmaint2001:~$ mwscript namespaceDupes.php --wiki=srwiki --fix --add-prefix=BROKEN # T286396 [23:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:52] T286396: Add ВП and WP aliases on srwiki - https://phabricator.wikimedia.org/T286396 [23:46:02] Kizule: please check the pages prefixed with BROKEN and delete as appropriate [23:46:40] urbanecm: I can't find them. [23:46:59] Are they prefixed like WP:BROKEN or BROKEN:WP? [23:47:19] namespace:BROKENxxx, where xxx is the namespace ID [23:47:23] ie. they're in the proper namespace [23:47:26] and prefixed with BROKEN [23:47:30] (03CR) 10Juan90264: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704166 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [23:47:38] this is an example https://www.irccloud.com/pastebin/zccKTn4n/ [23:47:46] (03CR) 10Juan90264: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704167 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [23:49:22] Urbanecm: I can't find it. [23:49:46] Kizule: https://sr.wikipedia.org/w/index.php?title=%D0%9F%D0%BE%D1%81%D0%B5%D0%B1%D0%BD%D0%BE:%D0%A1%D1%82%D1%80%D0%B0%D0%BD%D0%B8%D1%86%D0%B5%D0%A1%D0%B0%D0%9F%D1%80%D0%B5%D1%84%D0%B8%D0%BA%D1%81%D0%BE%D0%BC&namespace=4&prefix=BROKEN [23:50:43] !log Delete Project:BROKENPesak at sr.wikipedia to be able to rerun namespaceDupes.php (T286396) [23:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:58] !log urbanecm@mwmaint2002:~$ mwscript namespaceDupes.php --wiki=srwiki --fix --add-prefix=BROKEN # T286396 [23:51:00] urbanecm: I've deleted three pages which were listed there. [23:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:04] thanks [23:51:04] T286396: Add ВП and WP aliases on srwiki - https://phabricator.wikimedia.org/T286396 [23:51:18] please delete the other four :) [23:51:22] (at the same list) [23:51:57] !log urbanecm@mwmaint2002:~$ mwscript namespaceDupes.php --wiki=srwiki --fix --add-prefix=T286396 # T286396 [23:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:05] (03CR) 10Zabe: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704166 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [23:52:14] Kizule: and pages at https://sr.wikipedia.org/w/index.php?title=%D0%9F%D0%BE%D1%81%D0%B5%D0%B1%D0%BD%D0%BE:%D0%A1%D1%82%D1%80%D0%B0%D0%BD%D0%B8%D1%86%D0%B5%D0%A1%D0%B0%D0%9F%D1%80%D0%B5%D1%84%D0%B8%D0%BA%D1%81%D0%BE%D0%BC&namespace=4&prefix=T286396, too :) [23:52:45] (03CR) 10Zabe: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704167 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [23:53:10] urbanecm: Will do now. [23:53:52] (03CR) 10jerkins-bot: [V: 04-1] Use Wikimania's logo in a new vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704167 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [23:54:29] (03PS2) 10Urbanecm: Add sayahna.org to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703600 (https://phabricator.wikimedia.org/T286163) [23:54:34] (03CR) 10Urbanecm: [C: 03+2] Add sayahna.org to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703600 (https://phabricator.wikimedia.org/T286163) (owner: 10Urbanecm) [23:55:19] (03Merged) 10jenkins-bot: Add sayahna.org to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703600 (https://phabricator.wikimedia.org/T286163) (owner: 10Urbanecm) [23:55:57] (03PS4) 10Zabe: Use Wikimania's logo in a new vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704167 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [23:57:08] (03CR) 10jerkins-bot: [V: 04-1] Use Wikimania's logo in a new vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704167 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [23:57:14] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 1896efc27f3de39659673091bc4c43ad874da0c5: Add sayahna.org to the wgCopyUploadsDomains allowlist of Wikimedia Commons (T286163) (duration: 00m 56s) [23:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:21] T286163: Add sayahna.org to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T286163 [23:57:22] * urbanecm done