DataFrame Processing with FunctionNode and PipeNode¶
Introduction¶
The FunctionNode
and PipeNode
were built in large part to handle data processing pipelines with Pandas Series
and DataFrame
. The following examples do simple things with data, but provide a framework that can be expanded to meet a wide range of needs.
Tutorial Data Source¶
Following an example in Wes McKinney’s Python for Data Analysis, 2nd Edition (2017), these examples will use U.S. child birth name records from the Social Security Administration. Presently, this data is found at the following URL. We will write Python code to automatically download this data.
DataFrame Processing with FunctionNode¶
FunctionNode
wrapped functions can be used to link functions in linear compositions. What is passed to the nodes can change, as long as a node is prepared to receive the value of its predecessor. As before, core callables are called only after the complete composition expression is evaluated to a single function and called with the initial input.
We will use the follow imports throughout these examples. The requests
and pandas
third-party packages can be installed using pip
.
We will introduce the FunctionNode
decorated functions one at a time. We start with a function that, given a destination file path, will download the dataset (if it does not already exist), read the zip, and load the data into an OrderedDictionary
of DataFrame
keyed by year. Each DataFrame
has a column for “name”, “gender”, and “count”. We will for now store the URL as a module-level constant.
Next, we have a function that, given that same dictionary, produces a single DataFrame
that lists, for each year, the total number of males and females recorded with columns for “M” and “F”. Notice that the approach used below strictly requires the usage of an OrderedDictionary
.
Given row data that represent parts of whole, a utility function can be used to convert the previously created DataFrame
into percent floats.
A utility function can be used to select a contiguous year range from a DataFrame
indexed by integer year values. We expect the start
and end
parameters to provided through partialing, and the DataFrame
to be provided from the predecessor return value:
We can plot any DataFrame
using Pandas’ interface to matplotlib
(which will need to be installed and configured separately). The function takes an optional argument for destination file path and returns the same path after writing an image file.
Finally, to open the resulting plot for viewing, we will use Python’s webbrowser
module.
With all functions decorated as FunctionNode
, we can create a composition expression. The partialed start
and end
arguments permit selecting different year ranges. Notice that the data passed between nodes changes, from an OrderedDict
of DataFrame
, to a DataFrame
, to a file path string. To call the composition expression f
, we simply pass the necessary argument of the innermost load_data_dict
function.
If, for the sake of display, we want to convert the floating-point percents to integers before ploting, we do not need to modify the FunctionNode
implementation. As FunctionNode
support operators, we can simply scale the output of the percent
FunctionNode
by 100.
While this approach is illustrative, it is limited. Using simple linear composition, as above, it is not possible with the same set of functions to produce multiple plots with the same data, or both write plots and output DataFrame
data in Excel. This and more is possible with PipeNode
.
DataFrame Processing with PipeNode¶
Building on the tutorial from earlier (LINK NEEDED), we will now explore processing dataframes using PipeNode
.
While not required to use pipelines, is is useful to create a PipeNodeInput
subclass that will share state across the pipeline.
The following implementation of a PipeNodeInput
subclass stores the URL as the class attribute URL_NAMES
, and stores the output_dir
argument as an instance attribute. The load_data_dict
function is essentially the same as before, though here it is a classmethod
that reads URL_NAMES
from the class. The resulting data_dict
instance attribute is stored in the PipeNodeInput
, making it available to every node.
We can generalize the gender_count_per_year
function from above to count names per gender per year. Names often have variants, so we can match names with a passed-in function name_match
. As this node takes an expression-level argument, we decorate it with pipe_node_factory
. Setting this function to lambda n: True
results in exactly the same functionality as the gender_count_per_year
function. Recall how we can access data_dict
from the positionally bound pni
argument.
A number of functions used above as FunctionNode
can be recast as PipeNode
by simpy binding fpn.PREDECESSOR_RETURN
as the first positional argument. Recall that PNs that need expression-level arguments are decorated with pipe_node_factory
. The plot
node now takes a file_name
argument, to be combined with the output directory set in the PipeNodeInput
instance.
With these nodes defined, we can create many differnt processing pipelines. For example, to plot two graphs, one each for the distribution of names that start with “lesl” and “dana”, we can create the following expression. Notice that, for maximum efficiency, load_data_dict
is called only once in the PipeNodeInput
. Further, now that plot
takes a file name argument, we can uniquely name our plots.
To support graphing the gender distribution for multiple names simultaneously, we can create a specialized node to merge PipeNode
expressions passed as key-word arguments. We will then merge all those DataFrame
key-value pairs.
Now we can create two expressions for each name we are investigating. These are then passed to merge_gender_data
as key-word arguments. In all cases the raw data DataFrame
is now retained with the store
PipeNode
. After plotting and viewing, we can retrieve and iterate over stored keys and DataFrame
by accessing the store_items
property of PipeNodeInput
. In this example, we load each DataFrame
into a sheet of an Excel workbook.
These examples demonstrate organizing data processing routines with PipeNode
expressions. Using PipeNodeInput
sublcasses, data acesss routines can be centralized and made as efficient as possible. Further, PipeNodeInput
sublcasses can provide common parameters, such as output directories, to all nodes. Finally, the results of sub-expressions can be stored and recalled within PipeNode
expressions, or extracted after PipeNode
execution for writing to disk.
Conclusion¶
After going through this tutorial, you should now have an understanding of:
How to use
fpn.FunctionNode
to do DataFrame processingHow to use
fpn.PipeNode
to do DataFrame processing
Here is all of the code examples we have seen so far: