Make edgeStorageLevel and vertexStorageLevel configurable #225

estebandonato · 2017-08-02T22:47:44Z

This PR addresses issue #219 The properties intermediateVertexStorageLevel and intermediateEdgeStorageLevel with their respective getters and setters were added to the classes that represent graph algorithms with a GraphX implementation. These 2 properties are used to change the storage level of the GraphX instance referenced by GraphFrame.cachedTopologyGraphX before calling the GraphX algorithm implementation.

In the special case of ConnectedComponent class, the properties were named graphxVertexStorageLevel and graphxEdgeStorageLevel to differentiate them with the property intermediateStorageLevel used in the graphframe implementation of this algorithm. Also the Python api of ConnectedComponent was modified to include a similar feature implemented in #213 only to the Scala api

These changes were applied to both the Scala and Python apis

…n label propagation - Scala api

…n label propagation - Python api

…n pagerank - Scala api

…n parallel personalized pagerank - Scala api

…n shortest path - Scala api

…n scc - Scala api

…n cc - Scala api

…or all algorithms in python

codecov-io · 2017-08-04T08:50:12Z

Codecov Report

Merging #225 into master will increase coverage by 0.72%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #225      +/-   ##
==========================================
+ Coverage   89.18%   89.91%   +0.72%     
==========================================
  Files          20       20              
  Lines         740      793      +53     
  Branches       40       44       +4     
==========================================
+ Hits          660      713      +53     
  Misses         80       80

Impacted Files	Coverage Δ
...n/scala/org/graphframes/lib/LabelPropagation.scala	`100% <100%> (ø)`	⬆️
...cala/org/graphframes/lib/ConnectedComponents.scala	`95.37% <100%> (+0.42%)`	⬆️
.../graphframes/lib/StronglyConnectedComponents.scala	`100% <100%> (ø)`	⬆️
src/main/scala/org/graphframes/GraphFrame.scala	`86.95% <100%> (+0.38%)`	⬆️
...graphframes/lib/ParallelPersonalizedPageRank.scala	`100% <100%> (ø)`	⬆️
...main/scala/org/graphframes/lib/ShortestPaths.scala	`96.66% <100%> (+1.01%)`	⬆️
src/main/scala/org/graphframes/lib/PageRank.scala	`96.55% <100%> (+1.31%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7948381...903b948. Read the comment docs.

estebandonato · 2017-08-04T10:41:43Z

oraclejdk7 is no longer available in Trusty, so I modified travis configuration to use openjdk7. More details travis-ci/travis-ci#7964

I'm done with this PR and waiting for your feedback

felixcheung · 2017-09-23T08:19:04Z

instead of having these in each GF methods, I wonder if this should be a property of the "Graph" or "GraphFrame", for the "Edges" and "Vertices"?

@thunterdb @mengxr

felixcheung · 2017-09-23T08:19:26Z

.travis.yml

@@ -1,5 +1,5 @@
 jdk:
-  - oraclejdk7 # openJDK crashes sometimes
+  - openjdk7 # openJDK crashes sometimes


can you rebase to pick up the latest?

estebandonato · 2017-09-26T22:34:03Z

@felixcheung actually I thought implementing this feature the way you describe, however I followed the current approach mainly to be aligned with the changes in api implemented in #213 and to allow different storage level settings for different algorithm executions. On the other hand, if we implement vertex and edge storage levels as GraphFrame properties we would be aligned with the approach used in GraphX. Both options have their pros and cons. Just let me know what you guys think and I can change the code if we decide to go with the properties approach.

felixcheung · 2017-09-28T05:59:15Z

right, I'm just worry the impact of having to have 2 parameters for every single one graph algo we have.
let's open this up and see anyone has any preference on the approach?

mengxr · 2017-09-28T15:58:14Z

@estebandonato Two comments:

Why do you need to control the storage levels for vertices/edges and in GraphX? I'm curious about the scenarios.
To prevent adding too many parameters to APIs, we can introduce a new class and predefined constants:

class GraphStorageLevel(vertexStorageLevel, edgeStorageLevel)

object GraphStorageLevel {
  val MEMORY_ONLY = ...
  val ...
}

Then in each method we only need one arg to describe the storage levels. We can overload the current method to take this type. But this is only needed if 1) can be justified.

estebandonato · 2017-10-02T16:42:12Z

@mengxr regarding your question 1) we need to control storage levels in GraphX mainly because most of the GraphFrame's graph algorithms are just wrappers of the GraphX implementations. These are the cases of PageRank, LabelPropagation, StronglyConnectedComponents, just to list some of them. These are all iterative algorithms which extensively cache vertices and edges on each iteration to avoid re-computation. Concretely, that's the case of Pregel implementation in GraphX https://github.com/apache/spark/blob/v2.1.1/graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala#L112. Storage levels used when caching a GraphX instance are defined in the Graph.apply method as it is shown in the Spark code below

def apply[VD: ClassTag, ED: ClassTag](
      vertices: RDD[(VertexId, VD)],
      edges: RDD[Edge[ED]],
      defaultVertexAttr: VD = null.asInstanceOf[VD],
      edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
      vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED]

Currently, when GraphFrame has to call a GraphX algorithm, it first converts a GraphFrame instance to a GraphX instance but there is no way to specify vertex & edge storage levels when creating the GraphX instance. As consequence the default MEMORY_ONLY storage level is used to cache the graph on each iteration. In same environments of limited memory, parts that cannot fit in memory are not cached and so they are recomputed, and the cost of recomputing parts gets worse on each new iteration. Specifically what we have experienced, for instance with PageRank is that performance degraded on each iteration. We have also seen that the amount of shuffle reads for the same tasks increased on each iteration. As workaround we had to copy the wrapper code implemented in GraphFrame in our code but changing the way a GraphFrame instance is converted to a GraphX instance so we can set the storage levels (we set to MEMORY_AND_DISK). With that workaround the response times and shuffle reads stopped degrading on each iteration. Hope this justify the change. Eventually if there are plans to migrate the graphX implementation to a custom GraphFrame implementation (like the case of connected component) we could re-use these parameters for the new implementation.

Regarding 2) if we want to prevent adding too many parameters to the API, another alternative is just to have 1 storage level value per algorithm and apply it to both vertices and edges

Let me know your feedback to make the changes accordingly.

estebandonato · 2017-10-17T18:21:47Z

@mengxr did you have the chance to read the reasons of this change explained above? Please let me know your thoughts

estebandonato · 2018-07-03T13:58:36Z

@mengxr any update on this?

estebandonato added 11 commits July 24, 2017 19:52

make intermediate storage level for vertices and edges configurable i…

6915dd5

…n label propagation - Scala api

make intermediate storage level for vertices and edges configurable i…

fd97375

…n label propagation - Python api

make intermediate storage level for vertices and edges configurable i…

fa48674

…n pagerank - Scala api

code optimization in pageRank test suite

1bc33c7

make intermediate storage level for vertices and edges configurable i…

8c29e92

…n parallel personalized pagerank - Scala api

make intermediate storage level for vertices and edges configurable i…

7608c7b

…n shortest path - Scala api

make intermediate storage level for vertices and edges configurable i…

62096ec

…n scc - Scala api

make intermediate storage level for vertices and edges configurable i…

ca68ab8

…n cc - Scala api

make intermediate storage level for vertices and edges configurable f…

71ae2e2

…or all algorithms in python

move from oraclejdk7 to openjdk7

4155549

disabling ParallelPersonalizedPage tests for spark version < 2.1

3e0979e

estebandonato mentioned this pull request Aug 4, 2017

Make edgeStorageLevel and vertexStorageLevel configurable #219

Open

felixcheung reviewed Sep 23, 2017

View reviewed changes

estebandonato added 2 commits September 26, 2017 15:30

Merge branch 'master' into set-edgeStorageLevel-vertexStorageLevel

80c8b21

disabling ParallelPersonalizedPage tests for spark v2.2

903b948

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make edgeStorageLevel and vertexStorageLevel configurable #225

Make edgeStorageLevel and vertexStorageLevel configurable #225

Uh oh!

estebandonato commented Aug 2, 2017

Uh oh!

codecov-io commented Aug 4, 2017 •

edited

Loading

Uh oh!

estebandonato commented Aug 4, 2017

Uh oh!

felixcheung commented Sep 23, 2017 •

edited

Loading

Uh oh!

felixcheung Sep 23, 2017

Uh oh!

estebandonato Sep 26, 2017

Uh oh!

estebandonato commented Sep 26, 2017

Uh oh!

felixcheung commented Sep 28, 2017 •

edited

Loading

Uh oh!

mengxr commented Sep 28, 2017

Uh oh!

estebandonato commented Oct 2, 2017 •

edited

Loading

Uh oh!

estebandonato commented Oct 17, 2017

Uh oh!

estebandonato commented Jul 3, 2018

Uh oh!

Uh oh!

Make edgeStorageLevel and vertexStorageLevel configurable #225

Are you sure you want to change the base?

Make edgeStorageLevel and vertexStorageLevel configurable #225

Uh oh!

Conversation

estebandonato commented Aug 2, 2017

Uh oh!

codecov-io commented Aug 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

estebandonato commented Aug 4, 2017

Uh oh!

felixcheung commented Sep 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

felixcheung Sep 23, 2017

Choose a reason for hiding this comment

Uh oh!

estebandonato Sep 26, 2017

Choose a reason for hiding this comment

Uh oh!

estebandonato commented Sep 26, 2017

Uh oh!

felixcheung commented Sep 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mengxr commented Sep 28, 2017

Uh oh!

estebandonato commented Oct 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

estebandonato commented Oct 17, 2017

Uh oh!

estebandonato commented Jul 3, 2018

Uh oh!

Uh oh!

codecov-io commented Aug 4, 2017 •

edited

Loading

felixcheung commented Sep 23, 2017 •

edited

Loading

felixcheung commented Sep 28, 2017 •

edited

Loading

estebandonato commented Oct 2, 2017 •

edited

Loading