Jupyter Notebook extension for Apache Spark integration.
Includes a progress indicator for the current Notebook cell if it invokes a Spark job. Queries the Spark UI service on the backend to get the required Spark job information.
This is really neat. No more checking another tab for job progress when running cells in a notebook!
I recently finished reading Data Science from Scratch by Joel Grus. This book is a great introduction to data science concepts. It uses real code to demonstrate complex Python, data analytics, data science, and machine learning concepts.
I’m really glad I picked up this book as the first book I’ve read about machine learning. There was a great combination of mathematics, statistics, and real applications of machine learning algorithms.
The book starts out with a quick introduction to Python, followed by an in-depth review of all the math you need for the code to make sense.
If you’re looking for a book that’ll show you how to use Tensorflow or scikit-learn, this book is not for you. I’d recommend reading this book before diving into those. You’ll learn about the math behind popular machine learning libraries and implement basic versions of some of the most popular algorithms from scratch.
I think the next book I’ll pick up after this one is Python Data Science Handbook which will go into more detail on using a bunch of Python libraries to do some of this machine learning for me.
Recently I’ve been playing around with Jekyll to create some simple websites. I’ve used Jekyll in the past and I remember that the set-up was a multi-step process.
Jekyll is a Ruby application that uses several Gems and Bundler. That means installing several dependencies. In my case I don’t have a Ruby development environment already set up, so I would have to install all these packages just to use a static site generator.
Then I found the official Jekyll Docker image.
I already have Docker installed to play around with other containers, so downloading a Jekyll container and using it was as easy as:
docker run --rm --label=jekyll --volume=$(pwd):/srv/jekyll \
-it -p 127.0.0.1:4000:4000 jekyll/jekyll jekyll serve
That’s all there is to it. This command will download the latest Jekyll image and start serving your site. No need to install Ruby, Gem, Bundler, or a bunch of other dependencies.
If you really want to get into the details of Python and learn about how the language was built and how some of its internals are implemented, Fluent Python is the book for you.
It’s a great book to refresh your knowledge of coroutines, asyncio, and other Python goodies.
You’ll definitely want to read this if you’re using AWS Kinesis with Apache Spark to stream data, it’s been extremely valuable:
If you’re just getting started with Flask or you want to learn about the innards of Django (yep, that’s right), “Flask Web Development” is the perfect place to start. This book dives right in with creating a full web application, including Jinja templates, authentication, building a REST API, forms, databases, security, and deployment to Heroku using Git. This book will get you up and running with Flask and then quickly go into detail on how to build a full web application.
However, in my opinion, Flask should be used for small applications, but this book goes into full detail about creating a half-Django for a full web application.
With that in mind, this book is great for learning about Django – how would you implement CSRF token checks? How would you set up database migrations from scratch? How would you handle forms? Django does all of that, but hides it all from developers. This book goes into full detail reimplementing a lot of what Django gives you out-of-the-box, which is great.
Overall I highly recommend “Flask Web Development” if you’re learning either Flask, Django, or just web-backend development in general. Don’t just use what Django gives you out of the box and ignore how it’s implemented. This book will answer questions like “Why does my Django app need a
SECRET_KEY? What is this CSRF error I keep seeing? How do database migrations work? How do I write my own mail handler?”, making you a better Django developer.
Get it here: http://a.co/73ERCK9
New tiny GitHub project: https://github.com/mikestaszel/spark_cluster_vagrant
Over the past few weeks I’ve been working on benchmarking Spark as well as learning more about setting up clusters of Spark machines both locally and on cloud providers.
I decided to work on a simple
Vagrantfile that spins up a Spark cluster with a head node and however many worker nodes desired. I’ve seen a few of these but they either used some 3rd party box, had an older version of Spark, or only spun up one node.
By running only one command I could have a fully-configured Spark cluster ready to use and test. Vagrant also easily extends beyond simple Virtualbox machines to many providers, including AWS EC2 and DigitalOcean and this
Vagrantfile can be extended to provision clusters on those providers.
Check it out here: https://github.com/mikestaszel/spark_cluster_vagrant
I just finished reading “Hello, Startup” by Yevgeniy Brikman, a book written for programmers about starting a startup. All the basics are covered, including hiring, teamwork, startup culture, and development methodology while scaling a startup. It’s a nice quick read (I skimmed through the chapters about development, programming, databases, and other technical chapters, but I found the other content to be a great place to start learning about what it takes to build a startup.
Check it out here (also available on Safari Books): http://www.hello-startup.net
I like to start my projects using Flask and Python because it’s fast and quick for most things, yet lightweight.
By default, Flask doesn’t give you much in terms of test frameworks, application settings, deployment, or running the application in production. I always end up making a skeleton that does some of these things, so I decided to put together a GitHub repository with a skeleton Flask project that does it for me.
Have a look here: https://github.com/mikestaszel/flask_startup
This weekend while running a rather large Python job, I ran into a memory error. It turned out that a dictionary I was populating could potentially become too big to fit into RAM. This is where DiskDict saved me some time.
It’s definitely not the best way to solve an issue, but in this case I was working with a limited system where rewriting the surrounding code would have been intrusive. Plus, the job didn’t have time constraints, so DiskDict was a decent workaround.
Wanted to share because it proved useful to me!