Recently I started playing around with Vowpal Wabbit and various data sets. Vowpal Wabbit promises to be really fast, so much so that disk IO is one of the most common bottlenecks according to the author. I did a quick test to see if using a RAM disk would make Vowpal Wabbit’s training faster. However, a RAM disk is not a silver bullet that will make Vowpal Wabbit faster, at least in my quick testing.
I used an AWS m4.2xlarge machine with 32GB RAM and created a 20GB RAM disk. I trained a logistic regression model on data from the Criteo Display Advertising Challenge.
I expected the RAM disk to be faster because reading data from RAM is 10-20 times faster than an SSD according to StackOverflow.
Vowpal Wabbit is a machine learning library, comparable to Spark’s MLib or scikit-learn.
Creating a RAM Disk in Ubuntu
This was easier than expected, there’s just one command to run:
sudo mount -t tmpfs -o size=20G tmpfs ramdisk/
This will create a 20GB RAM disk and mount it to a “ramdisk” directory. I copied my training dataset into this folder. I also had a copy of my training data saved on an EBS-optimized SSD attached to my instance.
Long story short, the disk wasn’t the bottleneck with training a logistic regression model on this dataset. Training on the file with 40 million+ rows with ~30 features per row took 3 minutes whether I trained on the dataset from the SSD or from the RAM disk. The exact command (very basic) was:
vw train.vw -f model.vw --loss_function logistic
One really obvious bottleneck here is the CPU – the command above will only use 1 core of the CPU! But that’s another blog post for a future date.