Tom's Blog

My 2025 Year in Books

Sun, 28 Dec 2025 15:00:24 -0600

It’s time for another review for books I read this year (previously: 2024, 2022). According to my GoodReads, I read 27 books this year. Here are some highlights:

The Demon-Haunted World

I started the year with Carl Sagan’s The Demon-Haunted World, as some kind of antidote to current / coming events. I last read this in about 2010, and held it in very high regard. I still do, but Carl comes off as a bit of a fuddy duddy at times (especially when talking about “the youth” today / in the 1990s). That’s not to say that he’s wrong about where society has gone (quiet the opposite), but it as a kind of tone. If you’re interested in an introduction to skepticism, I’d probably recommend the Skeptic’s Guide to the Universe.

GPU-Accelerated Zarr

Thu, 11 Dec 2025 08:00:00 -0600

This post gives detailed background to my PyData Global talk, “GPU-Accelerated Zarr” (slides, video). It deliberately gets into the weeds, but I will try to provide some background for people who are new to Zarr, GPUs, or both.

The first takeaway is that zarr-python natively supports NVIDIA GPUs. With a one-line zarr.config.enable_gpu() you can configure zarr to return CuPy arrays, which reside on your GPU:

>>> import zarr
>>> zarr.config.enable_gpu()
>>> z = zarr.open_array("path/to/store.zarr", mode="r")
>>> type(z[:])
cupy.ndarray

The second takeaway, and the main focus of this post, is that that simple one-liner leaves performance on the table. It depends a bit on your workload, but I’d claim that Zarr’s data loading pipeline shouldn’t ever be the bottleneck. Achieving maximum throughput today requires some care to ensure that the system’s resources are used efficiently. I’m hopeful that we can improve the libraries to do the right thing in more situations.

Gone Sailing

Sat, 18 Oct 2025 07:40:38 -0500

Last weekend I had the chance to sail in the 2025 Corn Coast Regatta. I had such a great time that I had to jot down my thoughts before they fade. This post is mostly for (future) me. We’ll return to our regularly scheduled programming in a future post. I have a post on Zarr performance cooking.

First, some context: in August I attended the Saylorville Yatch Club Sailing School Adult Small Boat class. This is a 3-day course that mixes some time in the classroom learning the theory and jargon (so much jargon!) with a bunch of time on the water. I had a bit of experience from sailing on summer weekends with my family growing up, but I wanted to learn more before going out on my own.

Cloud Native Geospatial Conference (2025)

Sun, 04 May 2025 00:00:00 -0600

You can watch a video version of this talk at https://youtu.be/BFFHXNBj7nA

On Thursday, I presented a talk, GPU Accelerated Cloud-Native Geospatial, at the inaugural Cloud-Native Geospatial Conference (slides here). This post will give an overview of the talk and some background on the prep. But first I wanted to say a bit about the conference itself.

The organizers (Michelle Roby, Jed Sundell, and others from Radiant Earth) did a fantastic job putting on the event. I only have the smallest experience with helping run a conference, but I know it’s a ton of work. They did a great job hosting this first run of conference.

High Performance Remote IO

Fri, 28 Feb 2025 15:18:34 -0600

I have a new post up at the NVIDIA technical blog on High-Performance Remote IO with NVIDIA KvikIO.¹

This is mostly general-purpose advice on getting good performance out of cloud object stores (I guess I can’t get away from them), but has some specifics for people using NVIDIA GPUs.

In the RAPIDS context, NVIDIA KvikIO is notable because

It automatically chunks large requests into multiple smaller ones and makes those requests concurrently.

It can read efficiently into host or device memory, especially if GPU Direct Storage is enabled.

It’s fast.

As part of preparing this, I got to write some C++. Not a fan!

Iowa's Proposed State Science Standards

Sat, 01 Feb 2025 12:00:00 -0600

My local Department of Education has a public comment period for some proposed changes to Iowa’s science education standards. If you live in Iowa, I’d encourage you to read the proposal (PDF) and share feedback through the survey. If you, like me, get frustrated with how difficult it is to see what’s changed or link to a specific piece of text, read on.

I’d heard rumblings that there were some controversial changes around evolution and climate change. But rather than just believing what I read in a headline, I decided to do my own research (science in action, right?).

Dask Geopandas Spatial Shuffle

Wed, 18 Dec 2024 10:33:34 -0600

Over at https://github.com/opengeospatial/geoparquet/discussions/251, we’re having a nice discussion about how best to partition geoparquet files for serving over object storage. Thanks to geoparquet’s design, just being an extension of parquet, it immediately benefits from all the wisdom around how best to partition plain parquet datasets. The only additional wrinkle for geoparquet is, unsurprisingly, the geo component.

It’s pretty common for users to read all the features in a small spatial area (a city, say) so optimizing for that use case is a good default. Simplifying a bit, reading small spatial subsets of a larger dataset will be fastest if all the features that are geographically close together are also “close” together in the parquet dataset, and each part of the parquet dataset only contains data that’s physically close together. That gives you the data you want in the fewest number of file reads / HTTP requests, and minimizes the amount of “wasted” reads (data that’s read, only to be immediately discarded because it’s outside your area of interest).

My 2024 Year in Books

Wed, 11 Dec 2024 08:00:00 -0600

Here’s another Year in Books (I missed last year, but here’s 2022).

Most of these came from recommendations by friends, The Incomparable’s Book Club and (a new source), the “Books in the Box” episodes of Oxide and Friends.

The Soul of a New Machine, by Tracy Kidder

I technically read it in the last few days of 2023, but included here because I liked it so much. This came recommended by the Oxide and Friends podcast’s Books in the Box episode. I didn’t know a ton about the history of computing, but have been picking up an appreciation for it thanks to reading this book. It goes into a ton of detail about what it took Data General to design and release a new machine. Highly recommended to anyone interested in computing.

Serializing Dataclasses

Sat, 31 Aug 2024 12:00:00 -0500

This post is a bit of a tutorial on serializing and deserializing Python dataclasses. I’ve been hacking on zarr-python-v3 a bit, which uses some dataclasses to represent some metadata objects. Those objects need to be serialized to and deserialized from JSON.

This is a (surprisingly?) challenging area, and there are several excellent libraries out there that you should probably use. My personal favorite is msgspec, but cattrs, pydantic, and pyserde are also options. But hopefully this can be helpful for understanding how those libraries work at a conceptual level (their exact implementations will look very different.) In zarr-python’s case, this didn’t quite warrant needing to bring in a dependency, so we rolled our own.

stac-geoparquet

Thu, 29 Aug 2024 20:00:00 -0500

I wrote up a quick introduction to stac-geoparquet on the Cloud Native Geo blog with Kyle Barron and Chris Holmes.

The key takeaway:

STAC GeoParquet offers a very convenient and high-performance way to distribute large STAC collections, provided the items in that collection are pretty homogenous

Check out the project at http://github.com/stac-utils/stac-geoparquet.

What's Next? (2024 edition)

Mon, 12 Aug 2024 07:00:00 -0500

I have, as they say, some personal news to share. On Monday I (along with some very talented teammates, see below if you’re hiring) was laid off from Microsoft as part of a reorganization. Like my Moving to Microsoft post, I wanted to jot down some of the things I got to work on.

For those of you wondering, the Planetary Computer project does continue, just without me.

Reflections

It should go without saying that all of this was a team effort. I’ve been incredibly fortunate to have great teammates over the years, but the team building out the Planetary Computer was especially fantastic. Just like before, this will be very self-centered and project-focused, overlooking all the other people and work that went into this.

My Real-World Match / Case

Wed, 13 Dec 2023 21:00:00 -0600

Ned Batchelder recently shared Real-world match/case, showing a real example of Python’s Structural Pattern Matching. These real-world examples are a great complement to the tutorial, so I’ll share mine.

While working on some STAC + Kerchunk stuff, in this pull request I used the match statement to parse some nested objects:

for k, v in refs.items():
 match k.split("/"):
 case [".zgroup"]:
 # k = ".zgroup"
 item.properties["kerchunk:zgroup"] = json.loads(v)
 case [".zattrs"]:
 # k = ".zattrs"
 item.properties["kerchunk:zattrs"] = json.loads(v)
 case [variable, ".zarray"]:
 # k = "prcp/.zarray"
 if u := item.properties["cube:dimensions"].get(variable):
 u["kerchunk:zarray"] = json.loads(refs[k])
 elif u := item.properties["cube:variables"].get(variable):
 u["kerchunk:zarray"] = json.loads(refs[k])
 case [variable, ".zattrs"]:
 # k = "prcp/.zattrs"
 if u := item.properties["cube:dimensions"].get(variable):
 u["kerchunk:zattrs"] = json.loads(refs[k])
 elif u := item.properties["cube:variables"].get(variable):
 u["kerchunk:zattrs"] = json.loads(refs[k])
 case [variable, index]:
 # k = "prcp/0.0.0"
 if u := item.properties["cube:dimensions"].get(variable):
 u.setdefault("kerchunk:value", collections.defaultdict(dict))
 u["kerchunk:value"][index] = refs[k]
 elif u := item.properties["cube:variables"].get(variable):
 u.setdefault("kerchunk:value", collections.defaultdict(dict))
 u["kerchunk:value"][index] = refs[k]

The for loop is iterating over a set of Kerchunk references, which are essentially the keys for a Zarr group. The keys vary a bit. They could be:

STAC Updates I'm Excited About

Sun, 15 Oct 2023 12:00:00 -0500

I wanted to share an update on a couple of developments in the STAC ecosystem that I’m excited about. It’s a great sign that even after 2 years after its initial release, the STAC ecosystem is still growing and improving how we can catalog, serve, and access geospatial data.

STAC and Geoparquet

A STAC API is a great way to query for data. But, like any API serving JSON, its throughput is limited. So in May 2022, the Planetary Computer team decided to export snapshots of our STAC database as geoparquet. Each STAC collection is exported as a Parquet dataset, where each record in the dataset is a STAC item. We pitched this as a way to do bulk queries over the data, where returning many and many pages of JSON would be slow (and expensive for our servers and database).

Gone Rafting

Sun, 13 Aug 2023 14:30:19 -0500

Last week, I was fortunate to attend Dave Beazley’s Rafting Trip course. The pretext of the course is to implement the Raft Consensus Algorithm.

I’ll post more about Raft, and the journey of implementing, it later. But in brief, Raft is an algorithm that lets a cluster of machines work together to reliably do something. If you had a service that needed to stay up (and stay consistent), even if some of the machines in the cluster went down, then you might want to use Raft.

National Water Model on Azure

Thu, 25 May 2023 12:04:06 -0500

A few colleagues and I recently presented at the CIROH Training and Developers Conference. In preparation for that I created a Jupyter Book. You can view it at https://tomaugspurger.net/noaa-nwm/intro.html I created a few cloud-optimized versions for subsets of the data, but those will be going away since we don’t have operational pipelines to keep them up to date. But hopefully the static notebooks are still helpful.

Lessons learned

Aside from running out of time (I always prepare too much material for the amount of time), I think things went well. JupyterHub (perhaps + Dask) and Kubernetes continues to be a great way to run a workshop.

Jupyter, STAC, and Tool Building

Sat, 15 Apr 2023 08:00:00 -0500

Over in Planetary Computer land, we’re working on bringing Sentinel-5P into our STAC catalog.

STAC items require a geometry property, a GeoJSON object that describes the footprint of the assets. Thanks to the satellites’ orbit and the (spatial) size of the assets, we started with some…interesting… footprints:

That initial footprint, shown in orange, would render the STAC collection essentially useless for spatial searches. The assets don’t actually cover (most of) the southern hemisphere.

py-spy in Azure Batch

Wed, 22 Feb 2023 15:11:37 -0600

Today, I was debugging a hanging task in Azure Batch. This short post records how I used py-spy to investigate the problem.

Background

Azure Batch is a compute service that we use to run container workloads. In this case, we start up a container that processes a bunch of GOES-GLM data to create STAC items for the Planetary Computer . The workflow is essentially a big

for url in urls:
 local_file = download_url(url)
 stac.create_item(local_file)

We noticed that some Azure Batch tasks were hanging. Based on our logs, we knew it was somewhere in that for loop, but couldn’t determine exactly where things were hanging. The goes-glm stactools package we used does read a NetCDF file, and my experience with Dask biased me towards thinking the netcdf library (or the HDF5 reader it uses) was hanging. But I wanted to confirm that before trying to implement a fix.

Planetary Computer Release: January 2023

Thu, 09 Feb 2023 00:00:00 +0000

The Planetary Computer made its January 2023 release a couple weeks back.

The flagship new feature is a really cool new ability to visualize the Microsoft AI-detected Buildings Footprints dataset. Here’s a little demo made by my teammate, Rob:

Your browser doesn't support HTML video. Here is a link to the video instead.

Currently, enabling this feature required converting the data from its native geoparquet to a lot of protobuf files with Tippecanoe. I’m very excited about projects to visualize the geoparquet data directly (see Kyle Barron’s demo) but for now we needed to do the conversion.

Cloud Optimized Vibes

Sat, 14 Jan 2023 16:16:11 -0600

Over on the Planetary Computer team, we get to have a lot of fun discussions about doing geospatial data analysis on the cloud. This post summarizes some work we did, and the (I think) interesting conversations that came out of it.

Background: GOES-GLM

The instigator in this case was onboarding a new dataset to the Planetary Computer, GOES-GLM. GOES is a set of geostationary weather satellites operated by NOAA, and GLM is the Geostationary Lightning Mapper, an instrument on the satellites that’s used to monitor lightning. It produces some really neat (and valuable) data.

Queues in the News

Mon, 26 Dec 2022 13:35:24 -0600

I came across a couple of new (to me) uses of queues recently. When I came up with the title to this article I knew I had to write them up together.

Queues in Dask

Over at the Coiled Blog, Gabe Joseph has a nice post summarizing a huge amount of effort addressing a problem that’s been vexing demanding Dask users for years. The main symptom of the problem was unexpectedly high memory usage on workers, leading to crashing workers (which in turn caused even more network communication, and so more memory usage, and more crashing workers). This is actually a problem I worked on a bit back in 2019, and I made very little progress.

My 2022 Year in Books

Wed, 21 Dec 2022 07:25:47 -0600

It’s “Year in X” time, and here’s my 2022 Year in Books on GoodReads. I’ll cover some highlights here.

Many of these recommendations came from the Incomparable’s Book Club, part of the main Incomparable podcast. In particular, episode 600 The Machine was a Vampire which is a roundup of their favorites from the 2010s.

Bookended by Murderbot Diaries

I started and ended this year (so far) with a couple installments in the Murderbot Diaries. These follow a robotic / organic “Security Unit” that’s responsible for taking care of humans in dangerous situations. We pick up after an unfortunate incident where it seems to have gone rouge and murdered her clients (hence, the murderbot) and hacked its governor module to essentially become “free”.

Podcast: Revolutions

Tue, 20 Dec 2022 16:56:57 -0600

Mike Duncan is wrapping up his excellent Revolutions podcast. If you’re at all interested in history then now is a great time to pick it up. He takes the concept of “a revolution” and looks at it through the lens of a bunch of revolutions throughout history. The appendix episodes from the last few weeks have really tied things together, looking at whats common (and not) across all the revolutions covered in the series.

Rebooting

Sun, 18 Dec 2022 16:51:46 -0600

Like some others, I’m getting back into blogging.

I’ll be “straying from my lane” and won’t just be writing about Python data libraries (though there will still be some of that). If you too would like to blog more, I’d encourge you to read Simon Willison’s What to blog About and Matt Rocklin’s Write Short Blogposts.

Because I’m me, I couldn’t just make a new post. I also had to switch static site generators, just becauase. All the old links, including my RSS feed, should continue to work. If you spot any issues, [email protected]">let me know (I think I’ve fixed at least one bug in the RSS feed, apologies for any spurious updates. But just in case, you might want to update your RSS links to http://tomaugspurger.net/index.xml).

What's Next?

Wed, 11 Nov 2020 00:00:00 +0000

Some personal news: Last Friday was my last day at Anaconda. Next week, I’m joining Microsoft’s AI for Earth team. This is a very bittersweet transition. While I loved working at Anaconda and all the great people there, I’m extremely excited about what I’ll be working on at Microsoft.

Reflections

I was inspired to write this section by Jim Crist’s post on a similar topic: https://jcristharif.com/farewell-to-anaconda.html. I’ll highlight some of the projects I worked on while at Anaconda. If you want to skip the navel gazing, skip down to what’s next.

Maintaining Performance

Wed, 01 Apr 2020 00:00:00 +0000

As pandas’ documentation claims: pandas provides high-performance data structures. But how do we verify that the claim is correct? And how do we ensure that it stays correct over many releases. This post describes

pandas’ current setup for monitoring performance
My personal debugging strategy for understanding and fixing performance regressions when they occur.

I hope that the first section topic is useful for library maintainers and the second topic is generally useful for people writing performance-sensitive code.

Compatibility Code

Thu, 12 Dec 2019 00:00:00 +0000

Compatibility Code

Most libraries with dependencies will want to support multiple versions of that dependency. But supporting old version is a pain: it requires compatibility code, code that is around solely to get the same output from versions of a library. This post gives some advice on writing compatibility code.

Don’t write your own version parser
Centralize all version parsing
Use consistent version comparisons
Use Python’s argument unpacking
Clean up unused compatibility code

1. Don’t write your own version parser

It can be tempting just do something like

Dask Workshop

Thu, 12 Dec 2019 00:00:00 +0000

Dask Summit Recap

Last week was the first Dask Developer Workshop. This brought together many of the core Dask developers and its heavy users to discuss the project. I want to share some of the experience with those who weren’t able to attend.

This was a great event. Aside from any technical discussions, it was ncie to meet all the people. From new acquaintences to people you’re on weekly calls with, it was great to interact with everyone.

pandas + binder

Sun, 21 Jul 2019 00:00:00 +0000

This post describes the start of a journey to get pandas’ documentation running on Binder. The end result is this nice button:

For a while now I’ve been jealous of Dask’s examples repository. That’s a repository containing a collection of Jupyter notebooks demonstrating Dask in action. It stitches together some tools to present a set of documentation that is both viewable as a static site at examples.dask.org, and as a executable notebooks on mybinder.

A Confluence of Extension

Tue, 18 Jun 2019 00:00:00 +0000

This post describes a few protocols taking shape in the scientific Python community. On their own, each is powerful. Together, I think they enable for an explosion of creativity in the community.

Each of the protocols / interfaces we’ll consider deal with extending.

First, a bit of brief background on each.

NEP-13 and NEP-18, each deal with using the NumPy API on non-NumPy ndarray objects. For example, you might want to apply a ufunc like np.log to a Dask array.

Tabular Data in Scikit-Learn and Dask-ML

Mon, 17 Sep 2018 00:00:00 +0000

Scikit-Learn 0.20.0 will contain some nice new features for working with tabular data. This blogpost will introduce those improvements with a small demo. We’ll then see how Dask-ML was able to piggyback on the work done by scikit-learn to offer a version that works well with Dask Arrays and DataFrames.

import dask
import dask.array as da
import dask.dataframe as dd
import numpy as np
import pandas as pd
import seaborn as sns
import fastparquet
from distributed import Client
from distributed.utils import format_bytes

Background

For the most part, Scikit-Learn uses NumPy ndarrays or SciPy sparse matricies for its in-memory data structures. This is great for many reasons, but one major drawback is that you can’t store heterogenous (AKA tabular) data in these containers. These are datasets where different columns of the table have different data types (some ints, some floats, some strings, etc.).

Distributed Auto-ML with TPOT with Dask

Thu, 30 Aug 2018 00:00:00 +0000

This work is supported by Anaconda Inc.

This post describes a recent improvement made to TPOT. TPOT is an automated machine learning library for Python. It does some feature engineering and hyper-parameter optimization for you. TPOT uses genetic algorithms to evaluate which models are performing well and how to choose new models to try out in the next generation.

Parallelizing TPOT

In TPOT-730, we made some modifications to TPOT to support distributed training. As a TPOT user, the only changes you need to make to your code are

Moral Philosophy for pandas or: What is `.values`?

Tue, 14 Aug 2018 00:00:00 +0000

The other day, I put up a Twitter poll asking a simple question: What’s the type of series.values?

Pop Quiz! What are the possible results for the following:

>>> type(pandas.Series.values)
— Tom Augspurger (@TomAugspurger) August 6, 2018

I was a bit limited for space, so I’ll expand on the options here. Choose as many as you want.

NumPy ndarray
pandas Categorical (or all of the above)
An Index or any of it’s subclasses (DatetimeIndex, CategoricalIndex, RangeIndex, etc.) (or all of the above)
None or all of the above

I was prompted to write this post because a.) this is an (unfortunately) confusing topic and b.) it’s undergoing a lot of change right now (and, c.) I had this awesome title in my head).

Modern Pandas (Part 8): Scaling

Mon, 23 Apr 2018 00:00:00 +0000

This is part 1 in my series on writing modern idiomatic pandas.

As I sit down to write this, the third-most popular pandas question on StackOverflow covers how to use pandas for large datasets. This is in tension with the fact that a pandas DataFrame is an in memory container. You can’t have a DataFrame larger than your machine’s RAM. In practice, your available RAM should be several times the size of your dataset, as you or pandas will have to make intermediate copies as part of the analysis.

dask-ml 0.4.1 Released

Tue, 13 Feb 2018 00:00:00 +0000

This work is supported by Anaconda Inc and the Data Driven Discovery Initiative from the Moore Foundation.

dask-ml 0.4.1 was released today with a few enhancements. See the changelog for all the changes from 0.4.0.

Conda packages are available on conda-forge

$ conda install -c conda-forge dask-ml

and wheels and the source are available on PyPI

$ pip install dask-ml

I wanted to highlight one change, that touches on a topic I mentioned in my first post on scalable Machine Learning. I discussed how, in my limited experience, a common workflow was to train on a small batch of data and predict for a much larger set of data. The training data easily fits in memory on a single machine, but the full dataset does not.

Extension Arrays for Pandas

Mon, 12 Feb 2018 00:00:00 +0000

This is a status update on some enhancements for pandas. The goal of the work is to store things that are sufficiently array-like in a pandas DataFrame, even if they aren’t a regular NumPy array. Pandas already does this in a few places for some blessed types (like Categorical); we’d like to open that up to anybody.

A couple months ago, a client came to Anaconda with a problem: they have a bunch of IP Address data that they’d like to work with in pandas. They didn’t just want to make a NumPy array of IP addresses for a few reasons:

Easy distributed training with Joblib and dask

Mon, 05 Feb 2018 00:00:00 +0000

This work is supported by Anaconda Inc and the Data Driven Discovery Initiative from the Moore Foundation.

This past week, I had a chance to visit some of the scikit-learn developers at Inria in Paris. It was a fun and productive week, and I’m thankful to them for hosting me and Anaconda for sending me there. This article will talk about some improvements we made to improve training scikit-learn models using a cluster.

Rewriting scikit-learn for big data, in under 9 hours.

Sun, 28 Jan 2018 00:00:00 +0000

Towards the end of our week, Gael threw out the observation that for many applications, you don’t need to train on the entire dataset, a sample is often sufficient. But it’d be nice if the trained estimator would be able to transform and predict for dask arrays, getting all the nice distributed parallelism and memory management dask brings.

dask-ml

Thu, 26 Oct 2017 00:00:00 +0000

Today we released the first version of dask-ml, a library for parallel and distributed machine learning. Read the documentation or install it with

pip install dask-ml

Packages are currently building for conda-forge, and will be up later today.

conda install -c conda-forge dask-ml

The Goals

dask is, to quote the docs, “a flexible parallel computing library for analytic computing.” dask.array and dask.dataframe have done a great job scaling NumPy arrays and pandas dataframes; dask-ml hopes to do the same in the machine learning domain.

Scalable Machine Learning (Part 3): Parallel

Sat, 16 Sep 2017 00:00:00 +0000

This work is supported by Anaconda, Inc. and the Data Driven Discovery Initiative from the Moore Foundation.

This is part three of my series on scalable machine learning.

You can download a notebook of this post [here][notebook].

In part one, I talked about the type of constraints that push us to parallelize or distribute a machine learning workload. Today, we’ll be talking about the second constraint, “I’m constrained by time, and would like to fit more models at once, by using all the cores of my laptop, or all the machines in my cluster”.

Scalable Machine Learning (Part 2): Partial Fit

Fri, 15 Sep 2017 00:00:00 +0000

This work is supported by Anaconda, Inc. and the Data Driven Discovery Initiative from the Moore Foundation.

This is part two of my series on scalable machine learning.

You can download a notebook of this post here.

Scikit-learn supports out-of-core learning (fitting a model on a dataset that doesn’t fit in RAM), through it’s partial_fit API. See here.

The basic idea is that, for certain estimators, learning can be done in batches. The estimator will see a batch, and then incrementally update whatever it’s learning (the coefficients, for example). This link has a list of the algorithms that implement partial_fit.

Scalable Machine Learning (Part 1)

Mon, 11 Sep 2017 00:00:00 +0000

This work is supported by Anaconda Inc. and the Data Driven Discovery Initiative from the Moore Foundation.

Anaconda is interested in scaling the scientific python ecosystem. My current focus is on out-of-core, parallel, and distributed machine learning. This series of posts will introduce those concepts, explore what we have available today, and track the community’s efforts to push the boundaries.

You can download a Jupyter notebook demonstrating the analysis here.

Constraints

I am (or was, anyway) an economist, and economists like to think in terms of constraints. How are we constrained by scale? The two main ones I can think of are

Dask Performace Trip

Tue, 06 Sep 2016 00:00:00 +0000

I’m faced with a fairly specific problem: Compute the pairwise distances between two matrices $X$ and $Y$ as quickly as possible. We’ll assume that $Y$ is fairly small, but $X$ may not fit in memory. This post tracks my progress.

Introducing Stitch

Tue, 30 Aug 2016 00:00:00 +0000

Today I released stitch into the wild. If you haven’t yet, check out the examples page to see an example of what stitch does, and the Github repo for how to install. I’m using this post to explain why I wrote stitch, and some issues it tries to solve.

Why knitr / knitpy / stitch / RMarkdown?

Each of these tools or formats have the same high-level goal: produce reproducible, dynamic (to changes in the data) reports. They take some source document (typically markdown) that’s a mixture of text and code and convert it to a destination output (HTML, PDF, docx, etc.).

Modern Pandas (Part 7): Timeseries

Fri, 13 May 2016 00:00:00 +0000

This is part 7 in my series on writing modern idiomatic pandas.

Timeseries

Pandas started out in the financial world, so naturally it has strong timeseries support.

The first half of this post will look at pandas’ capabilities for manipulating time series data. The second half will discuss modelling time series data with statsmodels.

%matplotlib inline

import os
import numpy as np
import pandas as pd
import pandas_datareader.data as web
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style='ticks', context='talk')

if int(os.environ.get("MODERN_PANDAS_EPUB", 0)):
 import prep # noqa

Let’s grab some stock data for Goldman Sachs using the pandas-datareader package, which spun off of pandas:

Modern Pandas (Part 6): Visualization

Thu, 28 Apr 2016 00:00:00 +0000

This is part 6 in my series on writing modern idiomatic pandas.

Visualization and Exploratory Analysis

A few weeks ago, the R community went through some hand-wringing about plotting packages. For outsiders (like me) the details aren’t that important, but some brief background might be useful so we can transfer the takeaways to Python. The competing systems are “base R”, which is the plotting system built into the language, and ggplot2, Hadley Wickham’s implementation of the grammar of graphics. For those interested in more details, start with

Modern Pandas (Part 5): Tidy Data

Fri, 22 Apr 2016 00:00:00 +0000

This is part 5 in my series on writing modern idiomatic pandas.

Reshaping & Tidy Data

Structuring datasets to facilitate analysis (Wickham 2014)

So, you’ve sat down to analyze a new dataset. What do you do first?

In episode 11 of Not So Standard Deviations, Hilary and Roger discussed their typical approaches. I’m with Hilary on this one, you should make sure your data is tidy. Before you do any plots, filtering, transformations, summary statistics, regressions… Without a tidy dataset, you’ll be fighting your tools to get the result you need. With a tidy dataset, it’s relatively easy to do all of those.

Modern Panadas (Part 3): Indexes

Mon, 11 Apr 2016 00:00:00 +0000

This is part 3 in my series on writing modern idiomatic pandas.

Indexes can be a difficult concept to grasp at first. I suspect this is partly becuase they’re somewhat peculiar to pandas. These aren’t like the indexes put on relational database tables for performance optimizations. Rather, they’re more like the row_labels of an R DataFrame, but much more capable.

Modern Pandas (Part 4): Performance

Fri, 08 Apr 2016 00:00:00 +0000

This is part 4 in my series on writing modern idiomatic pandas.

Wes McKinney, the creator of pandas, is kind of obsessed with performance. From micro-optimizations for element access, to embedding a fast hash table inside pandas, we all benefit from his and others’ hard work. This post will focus mainly on making efficient use of pandas and NumPy.

Modern Pandas (Part 2): Method Chaining

Mon, 04 Apr 2016 00:00:00 +0000

This is part 2 in my series on writing modern idiomatic pandas.

Method Chaining

Method chaining, where you call methods on an object one after another, is in vogue at the moment. It’s always been a style of programming that’s been possible with pandas, and over the past several releases, we’ve added methods that enable even more chaining.

Modern Pandas (Part 1)

Mon, 21 Mar 2016 00:00:00 +0000

This is part 1 in my series on writing modern idiomatic pandas.

Effective Pandas

Introduction

This series is about how to make effective use of pandas, a data analysis library for the Python programming language. It’s targeted at an intermediate level: people who have some experience with pandas, but are looking to improve.

Prior Art

There are many great resources for learning pandas; this is not one of them. For beginners, I typically recommend Greg Reda’s 3-part introduction, especially if they’re familiar with SQL. Of course, there’s the pandas documentation itself. I gave a talk at PyData Seattle targeted as an introduction if you prefer video form. Wes McKinney’s Python for Data Analysis is still the goto book (and is also a really good introduction to NumPy as well). Jake VanderPlas’s Python Data Science Handbook, in early release, is great too. Kevin Markham has a video series for beginners learning pandas.

dplyr and pandas

Thu, 16 Oct 2014 07:00:00 +0000

This notebook compares pandas and dplyr. The comparison is just on syntax (verbage), not performance. Whether you’re an R user looking to switch to pandas (or the other way around), I hope this guide will help ease the transition.

We’ll work through the introductory dplyr vignette to analyze some flight data.

I’m working on a better layout to show the two packages side by side. But for now I’m just putting the dplyr code in a comment above each python call.

Practical Pandas Part 3 - Exploratory Data Analysis

Tue, 16 Sep 2014 00:00:00 +0000

Welcome back. As a reminder:

In part 1 we got dataset with my cycling data from last year merged and stored in an HDF5 store
In part 2 we did some cleaning and augmented the cycling data with data from http://forecast.io.

You can find the full source code and data at this project’s GitHub repo.

Today we’ll use pandas, seaborn, and matplotlib to do some exploratory data analysis. For fun, we’ll make some maps at the end using folium.

Practical Pandas Part 2 - More Tidying, More Data, and Merging

Thu, 04 Sep 2014 00:00:00 +0000

This is Part 2 in the Practical Pandas Series, where I work through a data analysis problem from start to finish.

It’s a misconception that we can cleanly separate the data analysis pipeline into a linear sequence of steps from

data acqusition
data tidying
exploratory analysis
model building
production

As you work through a problem you’ll realize, “I need this other bit of data”, or “this would be easier if I stored the data this way”, or more commonly “strange, that’s not supposed to happen”.

Practical Pandas Part 1 - Reading the Data

Tue, 26 Aug 2014 00:00:00 +0000

This is the first post in a series where I’ll show how I use pandas on real-world datasets.

For this post, we’ll look at data I collected with Cyclemeter on my daily bike ride to and from school last year. I had to manually start and stop the tracking at the beginning and end of each ride. There may have been times where I forgot to do that, so we’ll see if we can find those.

Using Python to tackle the CPS (Part 4)

Mon, 19 May 2014 12:01:00 +0000

Last time, we got to where we’d like to have started: One file per month, with each month laid out the same.

As a reminder, the CPS interviews households 8 times over the course of 16 months. They’re interviewed for 4 months, take 8 months off, and are interviewed four more times. So if your first interview was in month $m$, you’re also interviewed in months $$m + 1, m + 2, m + 3, m + 12, m + 13, m + 14, m + 15$$.

Using Python to tackle the CPS (Part 3)

Mon, 19 May 2014 12:00:00 +0000

In part 2 of this series, we set the stage to parse the data files themselves.

As a reminder, we have a dictionary that looks like


 id length start end
0 HRHHID 15 1 15
1 HRMONTH 2 16 17
2 HRYEAR4 4 18 21
3 HURESPLI 2 22 23
4 HUFINAL 3 24 26
 ... ... ... ...

giving the columns of the raw CPS data files. This post (or two) will describe the reading of the actual data files, and the somewhat tricky process of matching individuals across the different files. After that we can (finally) get into analyzing the data. The old joke is that statisticians spend 80% of their time munging their data, and 20% of their time complaining about munging their data. So 4 posts about data cleaning seems reasonable.

Tidy Data in Action

Thu, 27 Mar 2014 00:00:00 +0000

Hadley Whickham wrote a famous paper (for a certain definition of famous) about the importance of tidy data when doing data analysis. I want to talk a bit about that, using an example from a StackOverflow post, with a solution using pandas. The principles of tidy data aren’t language specific.

A tidy dataset must satisfy three criteria (page 4 in Whickham’s paper):

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

In this StackOverflow post, the asker had some data NBA games, and wanted to know the number of days since a team last played. Here’s the example data:

Organizing Papers

Thu, 13 Feb 2014 00:00:00 +0000

As a graduate student, you read a lot of journal articles… a lot. With the material in the articles being as difficult as it is, I didn’t want to worry about organizing everything as well. That’s why I wrote this script to help (I may have also been procrastinating from studying for my qualifiers). This was one of my earliest little projects, so I’m not claiming that this is the best way to do anything.

Using Python to tackle the CPS (Part 2)

Tue, 04 Feb 2014 12:00:00 +0000

Last time, we used Python to fetch some data from the Current Population Survey. Today, we’ll work on parsing the files we just downloaded.

We downloaded two types of files last time:

CPS monthly tables: a fixed-width format text file with the actual data
Data Dictionaries: a text file describing the layout of the monthly tables

Our goal is to parse the monthly tables. Here’s the first two lines from the unzipped January 1994 file:

Using Python to tackle the CPS

Mon, 27 Jan 2014 00:00:00 +0000

The Current Population Survey is an important source of data for economists. It’s modern form took shape in the 70’s and unfortunately the data format and distribution shows its age. Some centers like IPUMS have attempted to put a nicer face on accessing the data, but they haven’t done everything yet. In this series I’ll describe methods I used to fetch, parse, and analyze CPS data for my second year paper. Today I’ll describe fetching the data. Everything is available at the paper’s GitHub Repository.

About

Mon, 01 Jan 0001 00:00:00 +0000

Hi, I’m Tom. I’m a programmer living in Des Moines, IA.

Talks

GPU-Accelerated Zarr | video | slides
GPU-Accelerated Cloud-Native Geospatial | video | slides
Pandas: .head() to .tail() | video | materials
Mind the Gap! Bridging the scikit-learn - pandas dtype divide | video | materials
Pandas: .head() to .tail() | video | materials
Scalable Sustainability with the Planetary Computer (PyData Global) | video | materials
Scalable Sustainability Tutorial (Cloud Native Geospatial Day 2022) | video | materials
Scalable Geospatial Analysis | video
Planetary Computer overview at NASA’s Data and Computing Architecture Study (with Bruno) | video

Podcasts

Microsoft Planetary Computer on Talk Python
Pandas Extension Arrays on Podcast.__init__.

Writing

Effective Pandas: A series on writing effective, idiomatic pandas.
A few posts on Medium with various co-authors.
This Blog

Contact

Either on Mastodon @[email protected] or by email at [email protected]">[email protected].