Writing Huge CSVs Easily and Efficiently with PySpark

I recently ran into a use case that the usual Spark CSV writer didn’t handle very well – the data I was writing had an unusual encoding, odd characters, and was really large.

I needed a way to use the Python unicodecsv library with a Spark dataframe to write to a huge output CSV file.

… 

 

Jupyter Notebooks with PySpark on AWS EMR

One of the biggest, most time-consuming parts of data science is analysis and experimentation. One of the most popular tools to do so in a graphical, interactive environment is Jupyter.

Combining Jupyter with Apache Spark (through PySpark) merges two extremely powerful tools. AWS EMR lets you set up all of these tools with just a few clicks. In this tutorial I’ll walk through creating a cluster of machines running Spark with a Jupyter notebook sitting on top of it all.

… 

 

Vowpal Wabbit – Ramdisk vs. EBS-Optimized SSD

Recently I started playing around with Vowpal Wabbit and various data sets. Vowpal Wabbit promises to be really fast, so much so that disk IO is one of the most common bottlenecks according to the author. I did a quick test to see if using a RAM disk would make Vowpal Wabbit’s training faster. However, a RAM disk is not a silver bullet that will make Vowpal Wabbit faster, at least in my quick testing.

… 

 

Amazon Dash Button Hackery with Python

I bought a few Amazon Dash Buttons as part of Prime Day. These are the cheaper $4.99 buttons, not the more expensive $19.99 AWS IoT Buttons. In this blog post I’ll walk through how to make these cheaper buttons do what the more expensive button does.

… 

 

Jupyter Spark Integration

https://github.com/mozilla/jupyter-spark

Jupyter Notebook extension for Apache Spark integration.

Includes a progress indicator for the current Notebook cell if it invokes a Spark job. Queries the Spark UI service on the backend to get the required Spark job information.

This is really neat. No more checking another tab for job progress when running cells in a notebook!

 

Data Science from Scratch – Microreview

I recently finished reading Data Science from Scratch by Joel Grus. This book is a great introduction to data science concepts. It uses real code to demonstrate complex Python, data analytics, data science, and machine learning concepts.

I’m really glad I picked up this book as the first book I’ve read about machine learning. There was a great combination of mathematics, statistics, and real applications of machine learning algorithms.

The book starts out with a quick introduction to Python, followed by an in-depth review of all the math you need for the code to make sense.

If you’re looking for a book that’ll show you how to use Tensorflow or scikit-learn, this book is not for you. I’d recommend reading this book before diving into those. You’ll learn about the math behind popular machine learning libraries and implement basic versions of some of the most popular algorithms from scratch.

I think the next book I’ll pick up after this one is Python Data Science Handbook which will go into more detail on using a bunch of Python libraries to do some of this machine learning for me.

 

Jekyll in Docker

Recently I’ve been playing around with Jekyll to create some simple websites. I’ve used Jekyll in the past and I remember that the set-up was a multi-step process.

Jekyll is a Ruby application that uses several Gems and Bundler. That means installing several dependencies. In my case I don’t have a Ruby development environment already set up, so I would have to install all these packages just to use a static site generator.

Then I found the official Jekyll Docker image.

I already have Docker installed to play around with other containers, so downloading a Jekyll container and using it was as easy as:

docker run --rm --label=jekyll --volume=$(pwd):/srv/jekyll \
  -it -p 127.0.0.1:4000:4000 jekyll/jekyll jekyll serve

That’s all there is to it. This command will download the latest Jekyll image and start serving your site. No need to install Ruby, Gem, Bundler, or a bunch of other dependencies.

 

Fluent Python – Microreview

If you really want to get into the details of Python and learn about how the language was built and how some of its internals are implemented, Fluent Python is the book for you.

It’s a great book to refresh your knowledge of coroutines, asyncio, and other Python goodies.

 

Spark and Kinesis Optimization

You’ll definitely want to read this if you’re using AWS Kinesis with Apache Spark to stream data, it’s been extremely valuable:

https://aws.amazon.com/blogs/big-data/optimize-spark-streaming-to-efficiently-process-amazon-kinesis-streams/

 

Flask Web Development – Miguel Grinberg Microreview

If you’re just getting started with Flask or you want to learn about the innards of Django (yep, that’s right), “Flask Web Development” is the perfect place to start. This book dives right in with creating a full web application, including Jinja templates, authentication, building a REST API, forms, databases, security, and deployment to Heroku using Git. This book will get you up and running with Flask and then quickly go into detail on how to build a full web application.

However, in my opinion, Flask should be used for small applications, but this book goes into full detail about creating a half-Django for a full web application.

With that in mind, this book is great for learning about Django – how would you implement CSRF token checks? How would you set up database migrations from scratch? How would you handle forms? Django does all of that, but hides it all from developers. This book goes into full detail reimplementing a lot of what Django gives you out-of-the-box, which is great.

Overall I highly recommend “Flask Web Development” if you’re learning either Flask, Django, or just web-backend development in general. Don’t just use what Django gives you out of the box and ignore how it’s implemented. This book will answer questions like “Why does my Django app need a SECRET_KEY? What is this CSRF error I keep seeing? How do database migrations work? How do I write my own mail handler?”, making you a better Django developer.

Get it here: http://a.co/73ERCK9