[178] AggregateMessages: multiple message and aggregation columns #186

estebandonato · 2017-04-24T04:26:55Z

This PR is to address issue #178 There are 2 changes in AggregationMessages class:

You can call sendToSrc & sendToDst methods with multiple columns or sql expressions if you need to include multiple vertex and/or edge columns in the messages sent to the vertices
You can call agg method with multiple aggregation functions if you need different aggregations over the messages

Regarding 1. sendToSrc & sendToDst methods now accept varargs of org.apache.spark.sql.Column or String instances and when the message is created all these columns are grouped into a struct type. If both sendToSrc & sendToDst methods are called with multiple columns, both structs (number of columns, column names and types) have to be equal

Regarding 2. multiple aggregation functions are executed over the Dataframe that represents the messages. The resulting Dataframe contains the vertexId + one column per aggregation function passed to agg method with the result of each aggregation.

…thods as well as multiple aggregation functions

felixcheung · 2017-04-26T19:21:20Z

could you rebase this please?
and also, could you check the CI test failure?

estebandonato · 2017-05-02T14:18:03Z

sure, will work on that.
Regarding the test failures, it seems my change is not compatible with spark.version=1.6.3 & scala.version=2.10.6.

Will check both and let you know.

… 1.6 and Spark 2.x

estebandonato · 2017-05-10T13:45:11Z

code rebased and test failures caused by incompatibility with Spark v1.6.3 fixed. Now what it is failing are the python tests because of the new aggregateMessages API in python recently added. Working on that.

…in message as well as multiple aggregation functions

felixcheung · 2017-05-18T16:11:47Z

python/graphframes/graphframe.py

@@ -239,19 +240,27 @@ def aggregateMessages(self, aggCol, sendToSrc=None, sendToDst=None):
        if sendToSrc is not None:
            if isinstance(sendToSrc, Column):
                builder.sendToSrc(sendToSrc._jc)
+            elif isinstance(msgToSrc, list):
+                for col in msgToSrc:
+                    builder.sendToSrc(col._jc)


should probably check each of this col is a Column?

absolutely, this is still wip

felixcheung · 2017-05-18T16:42:27Z

src/main/scala/org/graphframes/lib/AggregateMessages.scala

+      case columns => df.select(
+        df(ID),
+        struct(columns.sorted.map(c => col(s"`${c}`").as(removeStructName(c))) :_*)
+          .as(AggregateMessages.MSG_COL_NAME))


can you explain and comment why this and associated helper functions are necessary?

this is the way AggregationMessages works: starts with a DataFrame resulting from calling GraphFrame.triplets, which generates a DF with columns: "src", "dst", "edge", each of which is of type struct with one struct element for each original attribute.

When 1 attribute per AggregateMessages was supported, this attribute was used in the projection of a query to this df. So let's say I have the following graph:

GraphFrame(v:[id: string, name: string, age: int], e:[src: string, dst: string, relationship: string])

If I wanted to send vertex attribute "name" from src to dst, that would generate the following query:

select dst['id'] AS id, src['name'] AS MSG from df.triplets

in the inverse direction

select src['id'] AS id, dst['name'] AS MSG from df.triplets

and in both directions:

select dst['id'] AS id, src['name'] AS MSG from df.triplets UNION select src['id'] AS id, dst['name'] AS MSG from df.triplets

Now, with the changes I introduced, multiples attribute can be sent in a message. The way it is resolved is to group all the attributes in a struct type. So following the same example, if I wanted to send both attributes "name" and "age" from src to dst the new query would be:

select dst['id'] AS id, struct(src['name'], src['age']) AS MSG from df.triplets

and similar for the other directions.

So this solution brings some challenges:

When I try to create a struct from existing struct elements the new generated df creates a struct naming each element col1, col2.. colN rather that the original name. With the example above try the following:

g.triplets.select(struct(expr("src['id']")).as("MSG")).printSchema g.triplets.select(struct(col("src")("id")).as("MSG")).printSchema

and you will see what I'm saying. So what basically line 133 does is to create the struct but maintaining the original attribute names.

Additionally what removeStructName and the other helper functions do is the following: If I kept the original name as I said above, then I would end up with the following struct:

struct("src.name", "src.age") as MSG

that represents the message sent from src to dst. However, the message in the opposite direction would be:

struct("dst.name", "dst.age") as MSG

Since both structs have different element names, the union between both df doesn't work. So the helper method removes the struct name from the columns to end up having the following struct for both directions:

struct("name", "age") as MSG

This way the union is possible.

Last but not least, what I found is that while in Spark 1.6 the default column name set in a DF when you select a struct element is struct_name[struct_element], i.e. src[name] in Spark 2.x it is struct_name.struct_element, i.e. src.name. That's why you will see the functions removeStructNameForSpark2 and removeStructNameForSpark16

Hope this clarifies your doubts. Please let me know if you have further questions

estebandonato · 2017-05-18T20:51:19Z

One check failed due to an OutOfMemory error while running test_connected_components_friends (graphframes.tests.GraphFrameLibTest) unit test which has nothing to do with the changes in this PR. Has this ever happened before? Do we have to increase the heap assigned to the jobs in Travis?

felixcheung · 2017-05-19T17:28:43Z

hmm, not sure. if you see PR # 195 which is a recent PR related to connected component, tests seem to be passing

…functions

codecov-io · 2017-05-21T18:13:37Z

Codecov Report

Merging #186 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master     #186   +/-   ##
=======================================
  Coverage   89.18%   89.18%           
=======================================
  Files          20       20           
  Lines         740      740           
  Branches       40       39    -1     
=======================================
  Hits          660      660           
  Misses         80       80

Impacted Files	Coverage Δ
.../scala/org/graphframes/lib/AggregateMessages.scala	`93.54% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7948381...dc6b0df. Read the comment docs.

estebandonato · 2017-05-22T14:34:24Z

@felixcheung I think PR #195 is about fixing the non-deterministic ID assignment when a GraphFrame is converted to a GraphX and the vertexId is not convertible to a Long type. So even though the bug was detected in the connected component module, the bug affects every module/class or piece of code that calls GraphFrame.toGraphX method, i.e. connected component, lpa, pagerank, etc. Actually the fix doesn't involve changes in the connected component class. I know because I've been following this PR very close because I experienced the same bug using lpa and proprietary algorithm implemented with Pregel.

Anyway, I just pushed some minor changes (docs, validations, etc) and now all checks have passed.

So this is a first version of this new feature. I'm planning to keep working on resolving the first limitation I listed above: "if both sendToSrc & sendToDst methods are called, then you have to call both methods with the same columns"

In the meantime if you can review this change and provide feedback, I would really appreciate it.

felixcheung · 2017-05-31T16:17:29Z

@estebandonato thanks for adding the details! I'm hoping to get back and review in a week or two. Sorry about the delay.

estebandonato · 2017-08-04T15:02:37Z

@felixcheung did you have the chance to review this pr? In the meantime, since spark 1.6 is no longer supported I will remove some helper functions to simplify the code.

felixcheung · 2017-08-07T17:11:52Z

sorry for dropping this - please update - I should have some time this week to review. thanks!

felixcheung · 2017-09-15T03:49:44Z

would you like to follow up?

estebandonato · 2017-09-17T13:56:40Z

yes, I'll be finishing this PR this week removing support for spark 1.6. Besides this change, the feature is completed. Sorry for the delay

estebandonato · 2017-10-15T20:28:54Z

@felixcheung finally I'm done with this feature. In the last commit I removed support for spark 1.6 and also performed some code refactoring which makes the code much simpler to follow. I also changed the api, so now if you want to send multiple columns in a message, rather than having to call sendToSrc or sendToDst multiple times you call it once since both methods accept a varargs of org.apache.spark.sql.Column or String for the case of sql expressions. Also as part of this refactoring I solved one of the limitations detailed in my first msg : "you cannot call sendToSrc or sendToDst with vertex and edge columns with the same name. Otherwise there will be a column name collision."

Please review the solution and let me know your feedback.

felixcheung · 2017-10-16T02:41:27Z

ah, thanks, I really like this newer approach, it seems to make a much cleaner solution.
Could you please the PR description here #186 (comment) to reflect the latest?

I'll review this in more details shortly.

estebandonato · 2017-10-17T16:04:13Z

@felixcheung done! I updated #186 comment with the latest.

Let me know your feedback when you review this.

SidWeng · 2018-07-03T09:02:57Z

really need this feature! what's the status now?

estebandonato · 2018-07-03T13:08:44Z

@SidWeng this feature is pending to be reviewed by @felixcheung
@felixcheung when do you think you can see this?

tchow-notion · 2022-08-09T18:36:16Z

@estebandonato hi, any updates on this feature?

estebandonato · 2022-08-23T13:24:56Z

@tchow-notion no updates unfortunately. Apparently it was not prioritized for revision and now the PR is stale. Except somebody has a different opinion, we should cancel this.

AggregateMessages: allow multiple calls to sendToSrc and sendToDst me…

a0c302b

…thods as well as multiple aggregation functions

estebandonato mentioned this pull request Apr 24, 2017

allow AggregateMessages send multiple columns to src and dst #178

Open

felixcheung self-assigned this Apr 25, 2017

felixcheung changed the title ~~AggregateMessages: multiple message and aggregation columns~~ [178] AggregateMessages: multiple message and aggregation columns Apr 26, 2017

estebandonato added 2 commits May 8, 2017 23:41

Merge branch 'master' into aggregatemessages-multiple-columns

24bf147

remove struct name from columns according to name convention in Spark…

d93cfbf

… 1.6 and Spark 2.x

estebandonato added 2 commits May 18, 2017 11:06

adapt aggregateMessages for Python to allow sending multiple columns …

336804d

…in message as well as multiple aggregation functions

Merge branch 'master' into aggregatemessages-multiple-columns

06e4369

felixcheung reviewed May 18, 2017

View reviewed changes

adapt test to python 3. Fix some bugs after rebasing

e81a254

estebandonato added 2 commits May 20, 2017 11:30

add type validation in python api. Update documentation in python method

401bb9f

allow multiple SQL expressions when sending messages and aggregation …

60c693e

…functions

estebandonato added 2 commits October 7, 2017 16:54

Merge branch 'master' into aggregatemessages-multiple-columns

9e180de

remove support for spark 1.6 + some api refactoring

dc6b0df

WeichenXu123 added the enhancement label Aug 25, 2022

[178] AggregateMessages: multiple message and aggregation columns #186

Are you sure you want to change the base?

[178] AggregateMessages: multiple message and aggregation columns #186

Uh oh!

Conversation

estebandonato commented Apr 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

felixcheung commented Apr 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

estebandonato commented May 2, 2017

Uh oh!

estebandonato commented May 10, 2017

Uh oh!

felixcheung May 18, 2017

Choose a reason for hiding this comment

Uh oh!

estebandonato May 18, 2017

Choose a reason for hiding this comment

Uh oh!

felixcheung May 18, 2017

Choose a reason for hiding this comment

Uh oh!

estebandonato May 22, 2017

Choose a reason for hiding this comment

Uh oh!

estebandonato commented May 18, 2017

Uh oh!

felixcheung commented May 19, 2017

Uh oh!

codecov-io commented May 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

estebandonato commented May 22, 2017

Uh oh!

felixcheung commented May 31, 2017

Uh oh!

estebandonato commented Aug 4, 2017

Uh oh!

felixcheung commented Aug 7, 2017

Uh oh!

felixcheung commented Sep 15, 2017

Uh oh!

estebandonato commented Sep 17, 2017

Uh oh!

estebandonato commented Oct 15, 2017

Uh oh!

felixcheung commented Oct 16, 2017

Uh oh!

estebandonato commented Oct 17, 2017

Uh oh!

SidWeng commented Jul 3, 2018

Uh oh!

estebandonato commented Jul 3, 2018

Uh oh!

tchow-notion commented Aug 9, 2022

Uh oh!

estebandonato commented Aug 23, 2022

Uh oh!

Uh oh!

estebandonato commented Apr 24, 2017 •

edited

Loading

felixcheung commented Apr 26, 2017 •

edited

Loading

codecov-io commented May 21, 2017 •

edited

Loading