[00:45:36] <icinga-wm>	 PROBLEM - Check unit status of monitor_refine_eventlogging_legacy on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[07:16:36] <wikibugs>	 10Data-Engineering, 10SRE: Allow kafka brokers to reload the TLS keystore - https://phabricator.wikimedia.org/T299409 (10elukey) Tried to reload the keystore on a couple of test brokers since the first warnings for tls cert expiry came up in icinga, but it doesn't seem to work. On the server.log I see stuff li...
[10:29:24] <wikibugs>	 (03PS1) 10DCausse: rdf_streaming_updater: add a reconcile event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/756536 (https://phabricator.wikimedia.org/T279541)
[10:49:28] <wikibugs>	 (03Abandoned) 10DCausse: rdf_streaming_updater: add a reconcile event schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/740109 (https://phabricator.wikimedia.org/T279541) (owner: 10DCausse)
[11:31:11] <wikibugs>	 10Data-Engineering-Kanban, 10Data-Catalog: [[wikitech:Data Catalog Application Evaluation Rubric]] links to some non-public Google Doc "execution plan" - https://phabricator.wikimedia.org/T299900 (10EChetty)
[11:48:49] <elukey>	 !log roll restart of kafka test brokers to pick up the new keystore/tls-certs (1y of validity)
[11:48:51] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:00:43] <elukey>	 notBefore=Jan 24 11:32:00 2022 GMT
[12:00:43] <elukey>	 notAfter=Jan 24 11:32:00 2023 GMT
[12:00:48] <elukey>	 this is from kafka-test1008 :)
[12:01:03] <elukey>	 dynamic reload seems not working for kafka 1.x (at least not they way we'd like)
[12:01:22] <elukey>	 it works well in kafka 2.1+, mayyybeeee we could upgrade in the future?? :)
[12:13:22] <wikibugs>	 10Data-Engineering, 10SRE: Allow kafka brokers to reload the TLS keystore - https://phabricator.wikimedia.org/T299409 (10elukey) 05Open→03Resolved a:03elukey It seems that our kafka version, 1.1, doesn't support well this use case. The kafka intermediate PKI CA now issues cert with 1y of validity, to red...
[12:13:29] <wikibugs>	 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10SRE, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10elukey)
[12:29:59] <joal>	 we surely will elukey - the question is when ;)
[12:32:10] <elukey>	 I am going to open a task tracking good things 
[12:32:25] <elukey>	 for the moment I see TLSv1.3, Java 11 and the dynamic reload
[12:54:45] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Data-Engineering, 10I18n: WikiReportsLocalizations.pm still fetches language names from SVN - https://phabricator.wikimedia.org/T64570 (10JAllemandou) Hi @Aklapper  - AFAIK the wikistats code is not available anywhere else than archive, so there is no "deploy" way for...
[13:05:24] <wikibugs>	 10Data-Engineering, 10Data-Catalog, 10Epic: Data Catalog MVP - https://phabricator.wikimedia.org/T299910 (10EChetty)
[13:05:47] <wikibugs>	 10Data-Engineering, 10Data-Catalog: Data Catalog Feature Matrix [Mile Stone 1] - https://phabricator.wikimedia.org/T299887 (10EChetty)
[13:05:56] <wikibugs>	 10Data-Engineering, 10Data-Catalog: Data Catalog Deployment Plan [Mile Stone 2] - https://phabricator.wikimedia.org/T299888 (10EChetty)
[13:06:08] <wikibugs>	 10Data-Engineering, 10Data-Catalog: Data Catalog Initial Deployment. [Mile Stone 3] - https://phabricator.wikimedia.org/T299893 (10EChetty)
[13:06:14] <wikibugs>	 10Data-Engineering, 10Data-Catalog: Connect MVP to a Data Source [Mile Stone 4] - https://phabricator.wikimedia.org/T299897 (10EChetty)
[13:06:21] <wikibugs>	 10Data-Engineering, 10Data-Catalog: Connect remaining Data Sources to the MVP [Mile Stone 5] - https://phabricator.wikimedia.org/T299899 (10EChetty)
[13:55:49] <wikibugs>	 10Data-Engineering, 10Metrics-Platform, 10SRE, 10Traffic, 10Patch-For-Review: VarnishKafka to propagate user agent client hints headers to webrequest - https://phabricator.wikimedia.org/T299401 (10EChetty)
[13:56:16] <wikibugs>	 10Data-Engineering, 10Metrics-Platform: Add user agent client hints to the  `webrequest` table - https://phabricator.wikimedia.org/T299402 (10EChetty)
[13:56:50] <wikibugs>	 10Data-Engineering, 10Anti-Harassment, 10Metrics-Platform, 10Privacy Engineering, and 3 others: Measure user-agent client hints already sent in browsers requests - https://phabricator.wikimedia.org/T299397 (10EChetty)
[14:15:55] <joal>	 ottomata: heya - let's try to find some time later in the day to talk about sparkSQL - I'll need a brainbounce as well
[14:20:05] <ottomata>	 okay!
[14:20:15] <ottomata>	 joal gimme 10 mins and now is probably good?
[14:21:28] <ottomata>	 maybe mforns  can join too?!
[14:24:36] <wikibugs>	 (03PS1) 10Aqu: Migrate AQS/hourly [analytics/refinery] - 10https://gerrit.wikimedia.org/r/756601 (https://phabricator.wikimedia.org/T299398)
[14:26:11] <wikibugs>	 (03PS2) 10Aqu: Migrate AQS/hourly [analytics/refinery] - 10https://gerrit.wikimedia.org/r/756601 (https://phabricator.wikimedia.org/T299398)
[14:28:18] <wikibugs>	 (03PS3) 10Aqu: [WIP] Migrate AQS/hourly [analytics/refinery] - 10https://gerrit.wikimedia.org/r/756601 (https://phabricator.wikimedia.org/T299398)
[14:33:23] <ottomata>	 joal:  mforns wanna chat?
[14:41:09] <mforns>	 heya ottomata yesss
[14:42:03] <mforns>	 bc?
[14:43:17] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Data-Engineering, 10I18n: WikiReportsLocalizations.pm still fetches language names from SVN - https://phabricator.wikimedia.org/T64570 (10elukey) I think that we are good, perl scripts of the old wikistats 1.x are not used/run anymore, and we deploy wikistats 2.x via p...
[14:43:34] <ottomata>	 mforns:  ya coming
[14:43:40] <mforns>	 k
[14:54:20] <ottomata>	 joal:  got meeting but lets also talk skein kerberos stuff today
[14:54:25] <ottomata>	 https://blog.cloudera.com/resource-localization-in-yarn-deep-dive/ is relevant
[14:54:30] <ottomata>	 but, not sufre
[14:57:16] <elukey>	 razzi: o/ about https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/753425 - are you still planning to do it? Otherwise I can take care of it during the next days, lemme know :)
[15:19:19] <wikibugs>	 10Data-Engineering, 10Metrics-Platform: Add user agent client hints to the  `webrequest` table - https://phabricator.wikimedia.org/T299402 (10EChetty) a:03JAllemandou
[16:19:03] <joal>	 ottomata: wanna chat before next meeting?
[16:39:28] <ottomata>	 ah sorry joal!  i guess after?
[16:39:42] <joal>	 yup :)
[16:59:07] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Discovery-Search, 10Product-Analytics, and 3 others: Write an Airflow job converting commons structured data dump to Hive - https://phabricator.wikimedia.org/T299059 (10Gehel)
[17:09:50] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [Airflow] Troubleshoot MySQL connection issues - https://phabricator.wikimedia.org/T298893 (10odimitrijevic)
[17:14:02] <mforns>	 ottomata: this might allow us to delete headers from the browser JS no?
[17:14:05] <mforns>	 https://developer.mozilla.org/en-US/docs/Web/API/Headers
[17:15:37] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering-Kanban, 10wmfdata-python, 10Product-Analytics (Kanban): wmfdata-python's Hive query output includes logspam - https://phabricator.wikimedia.org/T275233 (10nshahquinn-wmf) a:05Milimetric→03nshahquinn-wmf
[17:20:05] <ottomata>	 interesting mforns, maybe!  could work! 
[17:22:40] <ottomata>	 although, a pro of not doing it in code is that you don't have to do it every producer client
[17:22:48] <ottomata>	 e.g. MW server side i believe relies on this sometimes
[17:23:44] <ottomata>	 if metrics platform really could  provide a unified code interface for all instrumentation producers, and had implementation for them all, e.g. browse JS, MW PHP, android Java, etc. then it might be easier to just say 'let the code do it'
[17:33:07] <wikibugs>	 10Analytics-Radar, 10Product-Analytics, 10SDC General, 10Wikidata: Data about how many file pages on Commons contain at least one structured data element - https://phabricator.wikimedia.org/T238878 (10nettrom_WMF) 05Open→03Resolved a:03nettrom_WMF It's my understanding that this has been automated in...
[17:34:24] <wikibugs>	 10Analytics, 10Data-Engineering, 10Product-Analytics: AQS `edited-pages/new` metric does not make clear that the value is net of deletions - https://phabricator.wikimedia.org/T240860 (10nshahquinn-wmf) This is still an issue; the endpoint name and documentation do not accurately explain what the metric is.
[17:35:53] <wikibugs>	 10Data-Engineering, 10Stewards-and-global-tools: Collect information about users affected by blocks - https://phabricator.wikimedia.org/T297051 (10Ottomata) Not what you need, but just to be sure you are aware, there is a `event.mediawiki_user_blocks_change` Hive table that comes from the `mediawiki.user-block...
[17:44:39] <wikibugs>	 10Data-Engineering, 10Anti-Harassment, 10Metrics-Platform, 10Privacy Engineering, and 3 others: Measure user-agent client hints already sent in browsers requests - https://phabricator.wikimedia.org/T299397 (10CBogen) Hi @JAllemandou, is there an ask for the Structured Data team here or should we just track...
[17:47:30] <joal>	 hey mforns - would you have a minute for Sargento and myself? we have a question for you :)
[17:48:15] <Sargento>	 ? 
[17:48:27] <joal>	 woops- wrong ping excuse me Sargento -
[17:48:34] <mforns>	 joal: yes!
[17:49:00] <mforns>	 batcave?
[17:49:11] <joal>	 we're here mforns: https://meet.google.com/nby-kkhp-wsc
[17:51:20] <wikibugs>	 10Analytics, 10Metrics-Platform, 10Product-Analytics: Schema repository structure, naming - https://phabricator.wikimedia.org/T269936 (10Mayakp.wiki)
[17:53:28] <joal>	 ottomata: heya - would you be around?
[17:53:41] <mforns>	 ottomata: can you come to the bc 5 mins? we're discussing Airflow connections
[17:53:47] <mforns>	 oops, ok
[17:55:04] <btullis>	 razzi: are you working on mysql on an-test-coord1001 at all?
[17:55:05] <wikibugs>	 10Analytics, 10Product-Analytics, 10Patch-For-Review: Add dimensions to editors_daily dataset - https://phabricator.wikimedia.org/T256050 (10nshahquinn-wmf) a:03Mayakp.wiki
[17:55:25] <razzi>	 I'm not working on mysql, what do you mean btullis ?
[17:55:31] <razzi>	 Cool owl logo btw
[17:55:55] <razzi>	 Or I should say, why do you ask?
[17:55:58] <btullis>	 https://usercontent.irccloud-cdn.com/file/LC1BW2Z4/image.png
[17:56:11] <btullis>	 From #wikimedia-data-persistence
[17:56:33] <btullis>	 https://www.irccloud.com/pastebin/csU6DWFg/
[17:57:31] <milimetric>	 btullis: I'm building mysql 8 but I tried to isolate the install to a separate folder and I haven't even run make install yet, so I doubt it's me
[17:57:46] <btullis>	 I just wondered, because I know that we've been trying  a build of mysql 8.0 on this server, but I can't see any reason why the system provided one would misbehave.
[17:58:29] <razzi>	 huh yeah that's odd
[17:58:35] <btullis>	 milimetric: Yeah, that's what I was thinking. Your build should be totally isolated. 
[17:58:53] <razzi>	 I did resize the /srv partition, it's possible that affected the running mysql
[17:59:19] <razzi>	 but the /run path is not on /srv
[17:59:30] <wikibugs>	 10Analytics, 10Product-Analytics, 10Epic: Provide feature parity between the wiki replicas and the Analytics Data Lake - https://phabricator.wikimedia.org/T212172 (10nshahquinn-wmf)
[17:59:39] <btullis>	 Oh yeah, I think that's probably likely. `dmesg` is full of ext4 errors.
[17:59:47] <wikibugs>	 10Analytics, 10Product-Analytics, 10Epic: Add wikidata ids to data lake tables - https://phabricator.wikimedia.org/T221890 (10nshahquinn-wmf) 05Stalled→03Resolved >>! In T221890#7117063, @JAllemandou wrote: > Actually this table is now production-style on the cluster, at path `hdfs:///wmf/data/wmf/wikida...
[18:00:50] <btullis>	 https://www.irccloud.com/pastebin/hnJCTcjs/
[18:00:58] <ottomata>	 mforns:  joal  eating lunch, just jumped in bc
[18:01:27] <mforns>	 ottomata: we're here: https://meet.google.com/nby-kkhp-wsc
[18:01:46] <razzi>	 hm ok btullis let me restart mysql
[18:02:30] <razzi>	 but if the data is not in a good shape, we might have to reset the database itself
[18:03:21] <btullis>	 Yes, we might have to force an fsck and reboot.
[18:03:21] <btullis>	 Did you keep a record of what commands you ran to do the resizing?
[18:04:02] <razzi>	 Yeah btullis see https://wikitech.wikimedia.org/wiki/User:Razzi/First_logical_volume_resizing
[18:04:14] <elukey>	 razzi: not a big problem, it happens, but don't forget to !log things in here about things done so people are aware (test nodes are still production)
[18:04:27] <razzi>	 Ah yeah I should have done the !log elukey 
[18:04:31] <razzi>	 ty for the reminder
[18:07:32] <razzi>	 so: `systemctl restart mysqld` is hanging
[18:11:22] <btullis>	 OK. In future, you have to make sure that you reduce the size of the file system *first*, then reduce the logical volume, then you can resize the file system to fill all of the available space.
[18:11:54] <elukey>	 also, don't do it in production unless reviewed by others first :)
[18:12:41] <btullis>	 That's what I've always done, anyway. I'd rather do it when a server is in single user mode with no other processes running.
[18:13:16] <razzi>	 Yeah btullis that makes sense
[18:13:49] <btullis>	 In this case you might want to do `sudo touch /forcefsck` and reboot with a console, so that you can watch the fsck process and respond if necessary.
[18:13:54] <btullis>	 elukey: Would you agree?
[18:13:58] <ottomata>	 ^^ lvm resising usually works pretty seamlessly if done right, i wouldn't bother with single user mode, but i def would try and make sure nothing is running that is using the volume
[18:14:30] <ottomata>	 but single user mode would be even safer for sure
[18:14:39] <btullis>	 Sorry, I meant `sudo touch /srv/forcefsck`
[18:15:04] <razzi>	 hm, is that like a magic file that runs fsck?
[18:15:33] <btullis>	 OK, good to know. I've no issue with growing it while it's in use, but I've had incidents shrinking it whilst in use.
[18:15:34] <razzi>	 are you available btullis to troubleshoot some of this stuff realtime with me? I'm a bit over my head
[18:16:22] <btullis>	 I'm sort of cooking and doing childcare, so a bit multitasking.
[18:16:46] <razzi>	 sure thing, since it's the test cluster it's not particularly urgent
[18:17:12] <razzi>	 thanks for your help so far. Sorry for the mess...
[18:18:58] <btullis>	 No worries. It'll be fine. By the looks of `/var/log/syslog` it might just be one block, so an fsck might be OK.
[18:19:52] <razzi>	 so: unmount /srv, and run sudo fsck /dev/mapper-vg0-srv ?
[18:20:15] <razzi>	 it might cause a bunch of alerts, so maybe I should downtime the whole host?
[18:21:19] <btullis>	 Yep, that should do it. The `/forcefsck` is a useful way to do it for the root file system at next boot. It's just picked up by the initramfs, and scanned before boot. But your way is fine for a running server where we can unmount /srv/
[18:21:36] <btullis>	 Yeah, good idea.
[18:23:03] <razzi>	 !log downtime an-coord1001 while attempting to fix /srv partition
[18:23:05] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:24:05] <razzi>	 since the /srv partition is being accessed, I suppose I could either
[18:24:05] <razzi>	 - stop all the processes that are using it, or
[18:24:05] <razzi>	 - reboot into single user mode
[18:27:24] <ottomata>	 btullis: hm, ya maybe i'm remembering wrong and what i am remembering is just growing it, not shrinking it
[18:30:37] <razzi>	 I'm going to see if I can stop all the processes that are active on srv, starting with mysqld
[18:31:32] <razzi>	 are either of ottomata / elukey available to work together on this?
[18:32:15] <razzi>	 I guess I should start by stopping the processes that access mysql
[18:32:38] <wikibugs>	 10Data-Engineering, 10Anti-Harassment, 10Metrics-Platform, 10Privacy Engineering, and 3 others: Measure user-agent client hints already sent in browsers requests - https://phabricator.wikimedia.org/T299397 (10JAllemandou) >>! In T299397#7645734, @CBogen wrote: > Hi @JAllemandou, is there an ask for the Str...
[18:33:33] <ottomata>	 razzi: are you working on an-coord or an-test-coord1001?
[18:33:41] <razzi>	 an-test-coord, thankfully
[18:34:04] <ottomata>	 phewf!  you logged otherwise, you can edit the SAL wiki page after the fact
[18:34:16] <ottomata>	 https://issues.apache.org/jira/browse/AIRFLOW-2697
[18:34:19] <ottomata>	 oops
[18:34:22] <ottomata>	 v
[18:34:23] <ottomata>	 https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:34:39] <ottomata>	 razzi:  yes you can stop all processes
[18:34:41] <ottomata>	 on test-coord
[18:34:55] <ottomata>	 after you do that, you can lsof to make sure there isn't anything open on /srv
[18:34:59] <ottomata>	 if that is ok, then you can unmount it
[18:35:03] <ottomata>	 then you can do whatever you need to do to it
[18:35:04] <razzi>	 ah yeah I did
[18:35:17] <elukey>	 as far as I can see from https://wikitech.wikimedia.org/wiki/User:Razzi/First_logical_volume_resizing root was also resize right?
[18:35:31] <razzi>	 yeah I sized down /srv and sized up /
[18:36:36] <ottomata>	 sudo lsof /srv
[18:36:47] <wikibugs>	 10Analytics, 10Product-Analytics: Investigate easier methods for WMF staff to access Superset - https://phabricator.wikimedia.org/T258962 (10nshahquinn-wmf)
[18:37:12] <elukey>	 so if root has been resize while in use, it may have problems as well, and we'll need to run fsck on it as well
[18:37:38] <razzi>	 got it
[18:38:14] <elukey>	 so yeah as suggested before, I'd try to stop all processes using /srv/, umount it and fsck
[18:38:36] <elukey>	 then the /forcefsck suggestion from Ben + reboot (following via mgmt console)
[18:39:20] <elukey>	 need to run now :)
[18:40:13] <razzi>	 ok thanks for your help elukey!
[18:40:33] <ottomata>	 razzi:  i'm here, i haven't been totally following but can help as needed :)
[18:41:13] <btullis>	 Sorry, I'm right in the middle of feeding the kids. I can come back later to help as well if needed.
[18:42:07] <razzi>	 ottomata: think you could hop on a call for a minute to help me get my bearings?
[18:42:43] <razzi>	 I see about a dozen process accesing /srv and I'm not sure how to map those pids to systemctl units
[18:44:41] <ottomata>	 okay
[18:44:43] <wikibugs>	 10Analytics, 10Data-Engineering, 10Product-Analytics: Investigate easier methods for WMF staff to access Superset - https://phabricator.wikimedia.org/T258962 (10nshahquinn-wmf) The difficulty of accessing Superset is still an issue (despite significant and valuable work by Data Engineering and SRE to make it...
[18:45:19] <ottomata>	 razzi:  i'm in a huddle in the data-engineering-team slack
[18:49:47] <wikibugs>	 10Analytics-Radar, 10Data-Engineering-Radar, 10MediaViewer: Mediaviewer preloads should be marked as such via x-analytics tag - https://phabricator.wikimedia.org/T239655 (10nshahquinn-wmf)
[18:51:28] <wikibugs>	 10Analytics-Radar: Superset getting slower as usage increases - https://phabricator.wikimedia.org/T239130 (10nshahquinn-wmf) 05Open→03Declined A lot has changed with Superset in the last two years. In general, the speed of Superset itself (as opposed to the speed of certain queries) is not an issue.
[18:54:31] <wikibugs>	 10Analytics-Radar, 10Analytics-Wikistats: Feedback on Wikistats 2 new edits pages - https://phabricator.wikimedia.org/T210306 (10nshahquinn-wmf) 05Open→03Resolved Almost all of these issues have been resolved; the only exception has a separate task (T210423).
[18:55:49] <wikibugs>	 10Analytics-Radar, 10Product-Analytics: Migrate all reportupdater queries to hive - https://phabricator.wikimedia.org/T205296 (10nshahquinn-wmf)
[18:57:43] <wikibugs>	 10Analytics-Radar, 10Research ideas: Are watchlists dead? - https://phabricator.wikimedia.org/T166339 (10nshahquinn-wmf)
[19:02:42] <mforns>	 joal: you're coming to the superset date filters meeting?
[19:02:42] <razzi>	 !log razzi@an-test-coord1001:~$ sudo systemctl stop presto-server
[19:02:44] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:05:00] <joal>	 mforns: yes sorry, I'm late
[19:19:31] <ottomata>	 !log  kill mysqld on an-test-coord1001 - 19:19:04 [@an-test-coord1001:/etc] $ sudo kill 42433
[19:19:32] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:20:07] <ottomata>	 log  kill -9 mysqld on an-test-coord1001 - 19:19:04 [@an-test-coord1001:/etc] $ sudo kill -9  42433
[19:28:11] <btullis>	 I'm trying to join the Slack huddle, but it's not showing me an icon for joining. How's it going?
[19:28:25] <razzi>	 Lets join a hangout!
[19:28:26] <ottomata>	 btullis: lets go to google hangouts
[19:28:27] <ottomata>	 we need help
[19:28:29] <ottomata>	 bc
[19:28:33] <btullis>	 OK.
[19:42:06] <wikibugs>	 10Data-Engineering, 10Product-Analytics: Investigate Superset query templating as a mean to optimize partition pruning - https://phabricator.wikimedia.org/T299961 (10JAllemandou)
[19:43:13] <wikibugs>	 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10Eevans) @BTullis: Regarding which node(s) to (re)pool (presumably tomorrow, Jan 25?)...  It shouldn't...
[19:47:52] <joal>	 ottomata: is now a good time to talk?
[19:48:25] <joal>	 woops - actually probably not
[19:49:18] <wikibugs>	 10Analytics, 10Data-Engineering, 10Product-Analytics: Investigate easier methods for WMF staff to access Superset - https://phabricator.wikimedia.org/T258962 (10Ottomata) I think to really get this fixed, someone from high up in Tech and Product need to convince SRE that this is something that needs resource...
[19:50:54] <btullis>	 !log rebooting an-test-coord1001
[19:50:55] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:51:25] <joal>	 ottomata: https://meet.google.com/kti-iybt-ekv
[19:53:27] <btullis>	 !log power cycled an-test-coord1001 from racadm
[19:53:29] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[20:23:29] <btullis>	 I think that we have to re-create the file system on an-test-coord1001. After the reboot it's still unmountable. e2fsck doesn't like it. resize2fs doesn't like it.
[20:24:28] <btullis>	 The only way forward that I could see to recover it would be to boot to a different root file system
[20:24:52] <btullis>	 - resize the root file system down again, recalim the 52 GB that was reallocated from /srv/ to /
[20:25:26] <btullis>	 - give that 52 GB back to /srv and then try to resize and fix the fle system at that point.
[20:26:10] <btullis>	 I think that this would be a long shot and unlikely to work very well anyway. So I think we should recreate the file system on /srv/
[20:28:10] <btullis>	 !log root@an-test-coord1001:~# mke2fs -t ext4 -j -m 0.5 /dev/vg0/srv
[20:28:11] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[20:30:02] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Consider resizing an-test-coord1001 partitions - https://phabricator.wikimedia.org/T299930 (10BTullis)
[20:30:34] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Consider resizing an-test-coord1001 partitions - https://phabricator.wikimedia.org/T299930 (10BTullis) Recreated the file system on `/srv/` ` root@an-test-coord1001:~# mke2fs -t ext4 -j -m 0.5 /dev/vg0/srv mke2fs 1.44.5 (15-Dec-2018) /dev/vg0/srv contains a ext4 f...
[20:35:09] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Consider resizing an-test-coord1001 partitions - https://phabricator.wikimedia.org/T299930 (10BTullis) I have recovered the backup of mysql, such as it is: ` root@an-test-coord1001:~# cp -a /root/mysql /srv/ ` I have also created a few other directories that will...
[20:35:35] <btullis>	 !log rebooting an-test-coord1001 after recreating the /srv/file system.
[20:35:37] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[20:52:17] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Consider resizing an-test-coord1001 partitions - https://phabricator.wikimedia.org/T299930 (10BTullis) We started MySQL with an `innodb_force_recovery = 1` in the `/etc/my.cnf` config file. Then removed it and restarted again.  Re-enabled the hive-server2, hive-me...
[20:56:26] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Consider resizing an-test-coord1001 partitions - https://phabricator.wikimedia.org/T299930 (10ops-monitoring-bot) Icinga downtime set by razzi@cumin1001 for 1 day, 0:00:00 1 host(s) and their services with reason: Unmounting /srv to try to repair the filesystem `...
[20:59:46] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Consider resizing an-test-coord1001 partitions - https://phabricator.wikimedia.org/T299930 (10Ottomata) > they depend on files that are in /srv/deployment which aren't present Needs a puppet run and then a `scap deploy -e hadoop-test -l an-test-coord1001.eqiad.wmn...
[21:04:16] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Consider resizing an-test-coord1001 partitions - https://phabricator.wikimedia.org/T299930 (10BTullis) Thanks. We got this rather interesting message during a puppet run. {F34929371,width=70%} Rebooting as a result, since systemctl no longer returns.
[21:08:35] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Consider resizing an-test-coord1001 partitions - https://phabricator.wikimedia.org/T299930 (10BTullis) Had to give it another hard reboot from the console, because of the systemd abort. ` btullis@an-test-coord1001:/srv/mysql$ sudo shutdown -r now W: molly-guard: S...
[21:18:41] <btullis>	 !log btullis@deploy1002:/srv/deployment/analytics/refinery$ scap deploy -e hadoop-test -l an-test-coord1001.eqiad.wmnet
[21:18:42] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[21:18:50] <ottomata>	 hey yall am back 
[21:18:52] <ottomata>	 btullis: addshore 
[21:18:56] <ottomata>	 oops sorry add shore wrong ping
[21:18:57] <ottomata>	 razzi: 
[21:19:00] <ottomata>	 can i help in any way
[21:19:00] <ottomata>	 ?
[21:19:08] <addshore>	 :D
[21:20:42] <btullis>	 I think it's generally OK. There are a few services in icinga still not green, but they might be heading that way. Oozie says it's running, but I'm not 100% sure that it's on form.
[21:22:35] <ottomata>	 ok will take a look, 
[21:23:05] <btullis>	 Thanks. The oozie service says "Active(exited)"
[21:23:15] <ottomata>	 oozie  jobs -oozie $OOZIE_URL
[21:23:16] <ottomata>	 returns
[21:23:18] <ottomata>	 so that's good
[21:23:26] <ottomata>	 looks like it has state in mysql
[21:23:49] <btullis>	 O yeah, and Icinga is green for oozie now as well.
[21:24:23] <ottomata>	 gobblin looks okay eh?
[21:24:24] <ottomata>	 that's good.
[21:24:29] <btullis>	 I reckon.
[21:24:58] <ottomata>	 okay we stopped the refine jobs
[21:25:01] <ottomata>	 going to start them
[21:25:48] <ottomata>	 hive databases look good i thikn
[21:25:59] <btullis>	 I did a scap deploy as suggested, by the way. The /srv/deployment/refinery and /srv/deployment/refinery-cache had already been created by puppet, so I'm not 100% sure whether the deployment was needed.
[21:26:45] <ottomata>	 yeah it might not have been, but it might have been for the git-fat files
[21:26:52] <ottomata>	 iirc i think puppet doesn't do that part (properly?)
[21:26:58] <ottomata>	 but..maybe it did
[21:27:23] <ottomata>	 oh, are the timers already started
[21:27:26] <ottomata>	 oh yeah, we didn't disable them
[21:27:28] <ottomata>	 we just stopped them
[21:27:37] <ottomata>	 so i guess they arer started on boot
[21:27:51] <btullis>	 OK. Cool. I'm still dubious about the mysql, given that it's lost about 1.4 GB of space in /var/lib/mysql but that's the best we can do, eh?
[21:27:55] <ottomata>	 yeah looks ok
[21:27:56] <ottomata>	    Active: active (waiting) since Mon 2022-01-24 21:09:32 UTC; 18min ago
[21:27:59] <ottomata>	 yeah
[21:28:03] <ottomata>	 well so far okay?
[21:28:19] <ottomata>	 maybe mostly binlogs?
[21:28:22] <btullis>	 Yeah, I think I'm going to stand down for now, if that's ok with you.
[21:28:27] <ottomata>	 yeah go ahea
[21:28:28] <ottomata>	 d
[21:28:31] <ottomata>	 thanks btullis  much appreciated
[21:28:35] <ottomata>	 i'll try to keep an eye on things
[21:28:47] <btullis>	 Cool. It's a pleasure.
[21:29:02] <ottomata>	 haha i'm not sure 'pleasure' is quite the right word :p buuuut thank you! :)
[21:29:27] <btullis>	 :)
[21:36:30] <ottomata>	 hm refines faililng on test cluster
[21:36:30] <ottomata>	 org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'event' not found;
[21:36:40] <ottomata>	 hive client can see event db fine... hm
[21:38:19] <ottomata>	 ah, but not when actually examining a table.
[21:38:58] <ottomata>	 hmmm
[21:39:01] <ottomata>	 kind of?
[21:39:04] <ottomata>	 hmm...
[21:40:06] <ottomata>	 2022-01-24T21:39:49,843 ERROR [pool-9-thread-24] bonecp.ConnectionHandle: Database access problem. Killing off this connection and all remaining connections in the connection pool. SQL State = 08S01
[21:40:09] <ottomata>	 from hive metastore
[21:40:36] <ottomata>	 when accessing with spark
[21:40:43] <ottomata>	 hive CLI does seem okay...which is very strange
[21:41:33] <ottomata>	 yeah, thrift cli causes exceptoins in metastore logs
[21:41:36] <ottomata>	 but beeline works okay.
[21:41:37] <ottomata>	 it seems
[21:42:15] <ottomata>	 going to restart hive-metastore real quick
[21:44:19] <ottomata>	 strange that hive cli and beeline work tho
[21:44:34] <ottomata>	 going to consider wiping hive metastore db....
[21:44:47] <ottomata>	 but...i'll wait until tomorrow and do with bt ullis and/or razzi ^^^
[22:12:23] <razzi>	 cool thanks for taking a look ottomata. I extended the downtime window for 24 hours a couple hours ago, so it shouldn't alert until a bit before this time tomorrow 
[22:41:03] <ottomata>	 ok gr8 thanks
[23:02:15] <wikibugs>	 (03PS2) 10Jenniferwang: Bug: T299007 Add the mediawiki_reading_depth event platform stream to the allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/753178 (https://phabricator.wikimedia.org/T299007)
[23:13:38] <wikibugs>	 (03CR) 10Jenniferwang: "Thanks for the review. Please see my answers in lines." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/753178 (https://phabricator.wikimedia.org/T299007) (owner: 10Jenniferwang)
[23:48:50] <wikibugs>	 (03CR) 10Awight: Bug: T299007 Add the mediawiki_reading_depth event platform stream to the allowlist (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/753178 (https://phabricator.wikimedia.org/T299007) (owner: 10Jenniferwang)