Skip to content
This repository has been archived by the owner on Dec 13, 2022. It is now read-only.

Execution engine - Handle map and reduce #30

Open
anibalsolon opened this issue Jan 16, 2020 · 4 comments
Open

Execution engine - Handle map and reduce #30

anibalsolon opened this issue Jan 16, 2020 · 4 comments
Assignees

Comments

@anibalsolon
Copy link
Member

anibalsolon commented Jan 16, 2020

To support map/reduce operations, I propose two operators: @ for mapping, and % for reducing.

At each @ usage, it would spawn a dimension for the subsequent workflow (assuming a flat workflow is zero-dimensional). The result of using this operator is equivalent to the outer product of the dimensions, each entry being a job parametrization to execute.

job_b.field @= job_a.list_field

it reads as "map each entry from job_a.list_field to job_b.field

The % operator collapses dimensions, transforming their entries back into lists in the original ordering. It is a binary operation, in which the first argument is the entries to transform into a list, and the second argument is the dimensions to collapse.

job_c.field %= job_c.out_field @ job_a.list_field

it reads as "reduce the entries from job_c.out_field over the dimension job_a.list_field to job_c.field

A full example:

rp = ResourcePool()

pieces = lambda path: {'pieces': path.split('/'), 'indexes': range(path.count('/'))}
uppercase = lambda text: {'text': text.upper()}
indexed_uppercase = lambda text, index: {'text': `{index}-{text.upper()}`}
join = lambda pieces: {'text': '/'.join(pieces)}

job_pieces = PythonJob(function=pieces, reference='pieces_job')
job_pieces['path'] = Resource('usr/lib/libgimp.so')

job_uppercase = PythonJob(function=uppercase, reference='uppercase_job')
job_uppercase['text'] @= job_pieces.pieces

job_join = PythonJob(function=join, reference='join_job')
job_join['pieces'] %= job_uppercase.text @ job_pieces.pieces

rp[R('text')] = job_join.text

rp = DependencySolver(rp).execute(executor=Execution())

As to easy the mnemonics, one can name its dimensions, so when using the reduce operator, one can simply use the dimension name instead of having to refer to the original job:

job_uppercase['text'] @= job_pieces.pieces, 'path_pieces'

job_join['pieces'] %= job_uppercase.text @ 'path_pieces'

There are situations in which one might want to link fields in the same dimension. By providing a tuple of several fields, it will be mapped to the list of fields in the selector:

job_uppercase = PythonJob(function=indexed_uppercase, reference='uppercase_job')
job_uppercase[['text', 'index']] @= (job_pieces.pieces, job_pieces.indexes), 'path_pieces'

# All these reducing operators execute the same operation
job_join['pieces'] %= job_uppercase.text @ 'path_pieces'
job_join['pieces'] %= job_uppercase.text @ (job_pieces.pieces, job_pieces.indexes)
job_join['pieces'] %= job_uppercase.text @ (job_pieces.pieces)
@anibalsolon
Copy link
Member Author

anibalsolon commented Mar 19, 2020

@ccraddock @puorc Please, whenever you have time, take a look and let's discuss! And let me know if you need more details.

@puorc
Copy link
Contributor

puorc commented Mar 20, 2020

Hi Anibal, the map operator is nice! But I feel a little confused on 'reduce the entries from job_c.out_field over the dimension job_a.list_field to job_c.field'. Could you elaborate a bit more on ' dimension'? Does it represent one attribute of the return dict?

@anibalsolon
Copy link
Member Author

Yes, so every time we map a list of N items to N nodes, we create a dimension for a node: from a dot (a simple workflow, in a nildimensional space), we go to a line with N points.

image

Mapping a list:

image

Mapping two lists, so the input of each node is the combination of each pair of items in both lists:

image

And, as we reduce each of the dimensions, we create lists of the results of the nodes. It is the reducing operation, but instead of summarizing the results (like in an average), we are creating lists.

image

If we reduced/squashed both dimensions at the same time, and inputted it to a node, the value would be a list of lists (i.e. a matrix) of the results.

@puorc
Copy link
Contributor

puorc commented Apr 2, 2020

Thank you for the illustration. It's nice to work on.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants