A month ago I ran into issues with Syncthing that caused it to delete files that had been newly deleted and recreated. I thought that fixing that would solve all my issues with files randomly disappearing. Unfortunately, I’m back with another installment of “where did my files go?”
Before I get started, here are some quick introductions in case you’re unfamiliar with the standard Ruby toolbox: Rake is the Ruby programming language’s task automation tool. A
Rakefile contains a Ruby project’s tasks; which can be run in parallel threads. Rubygems is both a Ruby library and a management tool for installing other Ruby packages/libraries (“gems”).
My Rake tasks sometimes wouldn’t get off the ground due to an
Errno::ENOENT ‘No such file or directory’ error that appeared seemingly at random. Despite what the error message claimed, the files in question — never the same two times in a row — were very much present.
What made the situation even stranger was that the error messages were raised by a function inside of Rake. The function in question, used to check task file-dependencies, would first check that the file existed. Then, on the very next line, it would attempt to look up the modification-time of the file which would fail and raise the
Errno::ENOENT error. (See their implementation in file_task.rb.)
To make things worse, the problem would disappear almost completely if I introduced any delays. E.g. the problem would seemingly disappear if I tried to get debug info from Rake with
rake --trace or trace the storage system calls with
strace rake. The most reliable workaround involved relocating my project’s files from my fast NVMe SSD to a slow USB 2.0 thumb-drive.
In other words, I was dealing with a Heisenbug: a bug that seems to disappear or change when attempting to debug it. The observed behaviors also told me a whole lot about what could be going on; and none of it was good news. The problem was clearly some sort of timing-dependent race-condition. It could be a problem in just about any layer all the way down in the file system or the operating system kernel itself.
After literally thousands of runs, I’d managed to catch the bug while running with
strace and could better see what was going on. Something was calling the
chdir (change working directory) system call and was changing the Rake process’ working directory while Rake was busy checking file dependencies.
All POSIX-like operating system keeps track of each process’ working-directory on the operating-system level. Processes only have one working directory, so changing it affects all of that process’ threads. All my paths were written relative to the current working directory and nothing in my code would even attempt to change the current working directory. Ruby would raise
Errno::ENOENT errors because there indeed wasn’t any files with the specified relative paths inside the wrong working directory.
The something doing the working-directory switcheroo was walking down the directory three towards the system root directory. At each directory level, it was looking for four files:
Isolate. You may have realized, as I did at this point, that this was Rubygems looking for the project’s package-instruction file. My project didn’t need a
Gemfile so it didn’t have one either.
As mentioned earlier, Rake runs tasks in parallel threads. Threads yield execution when blocked by operations such as interacting with the file system.
So one Rake task would check and verify that a file exists and then yield the execution. The execution would be picked up by Rubygems who turned around and switched the working directory and yield execution before it had time to switch it back again. The execution would then continue in the previous or a completely different task thread; one that didn’t expect the working directory to have changed.
My code didn’t import the Rubygems library. Importing it is considered an anti-pattern even according to Rubygems’ own documentation. However, Rake does import it! I’m not going to discuss whether this is an anti-pattern or not. However, importing the Rubygems library in a multi-threaded process can definitely cause problems.
As I’d already discovered, Rubygems switches the process working-directory while looking for its own instruction files. (See their implementation in
util.rb.) Switching the working-directory in a library’s set-up process just to look for the presence of a couple of files in that directory is most definitely an anti-pattern.
It’s a common pattern in Ruby to change the working directory in a block of code and relying on Ruby to switch back at the end of the block’s execution. As discussed earlier, the current working directory is a per-process instance variable tracked by the kernel. This can’t be done in a thread-safe behavior without explicitly locking the process so the thread can run to completion.
I don’t understand why this happens. Rake loads Rubygems early in its initialization process — before my
Rakefile or code is loaded. As far as I can understand, neither Rake’s threads, the threads I create with Parallel, nor the main thread ever need to load the Rubygems library anew. I simply don’t know enough about the inner workings of Ruby to fully explain the observed behavior.
I can work around the problem by creating an empty
Gemfile next to my
Rakefile. Rubygems finds the file its looking for immediately and doesn’t begin its directory-walking journey down the directory hierarchy.