-
Notifications
You must be signed in to change notification settings - Fork 250
Make edgeStorageLevel and vertexStorageLevel configurable #225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Make edgeStorageLevel and vertexStorageLevel configurable #225
Conversation
…n label propagation - Scala api
…n label propagation - Python api
…n pagerank - Scala api
…n parallel personalized pagerank - Scala api
…n shortest path - Scala api
…n scc - Scala api
…or all algorithms in python
Codecov Report
@@ Coverage Diff @@
## master #225 +/- ##
==========================================
+ Coverage 89.18% 89.91% +0.72%
==========================================
Files 20 20
Lines 740 793 +53
Branches 40 44 +4
==========================================
+ Hits 660 713 +53
Misses 80 80
Continue to review full report at Codecov.
|
oraclejdk7 is no longer available in Trusty, so I modified travis configuration to use openjdk7. More details travis-ci/travis-ci#7964 I'm done with this PR and waiting for your feedback |
instead of having these in each GF methods, I wonder if this should be a property of the "Graph" or "GraphFrame", for the "Edges" and "Vertices"? |
.travis.yml
Outdated
@@ -1,5 +1,5 @@ | |||
jdk: | |||
- oraclejdk7 # openJDK crashes sometimes | |||
- openjdk7 # openJDK crashes sometimes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you rebase to pick up the latest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
@felixcheung actually I thought implementing this feature the way you describe, however I followed the current approach mainly to be aligned with the changes in api implemented in #213 and to allow different storage level settings for different algorithm executions. On the other hand, if we implement vertex and edge storage levels as GraphFrame properties we would be aligned with the approach used in GraphX. Both options have their pros and cons. Just let me know what you guys think and I can change the code if we decide to go with the properties approach. |
right, I'm just worry the impact of having to have 2 parameters for every single one graph algo we have. |
@estebandonato Two comments:
class GraphStorageLevel(vertexStorageLevel, edgeStorageLevel)
object GraphStorageLevel {
val MEMORY_ONLY = ...
val ...
} Then in each method we only need one arg to describe the storage levels. We can overload the current method to take this type. But this is only needed if 1) can be justified. |
@mengxr regarding your question 1) we need to control storage levels in GraphX mainly because most of the GraphFrame's graph algorithms are just wrappers of the GraphX implementations. These are the cases of PageRank, LabelPropagation, StronglyConnectedComponents, just to list some of them. These are all iterative algorithms which extensively cache vertices and edges on each iteration to avoid re-computation. Concretely, that's the case of Pregel implementation in GraphX https://github.com/apache/spark/blob/v2.1.1/graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala#L112. Storage levels used when caching a GraphX instance are defined in the Graph.apply method as it is shown in the Spark code below
Currently, when GraphFrame has to call a GraphX algorithm, it first converts a GraphFrame instance to a GraphX instance but there is no way to specify vertex & edge storage levels when creating the GraphX instance. As consequence the default Regarding 2) if we want to prevent adding too many parameters to the API, another alternative is just to have 1 storage level value per algorithm and apply it to both vertices and edges Let me know your feedback to make the changes accordingly. |
@mengxr did you have the chance to read the reasons of this change explained above? Please let me know your thoughts |
@mengxr any update on this? |
This PR addresses issue #219 The properties
intermediateVertexStorageLevel
andintermediateEdgeStorageLevel
with their respective getters and setters were added to the classes that represent graph algorithms with a GraphX implementation. These 2 properties are used to change the storage level of the GraphX instance referenced byGraphFrame.cachedTopologyGraphX
before calling the GraphX algorithm implementation.In the special case of ConnectedComponent class, the properties were named
graphxVertexStorageLevel
andgraphxEdgeStorageLevel
to differentiate them with the propertyintermediateStorageLevel
used in the graphframe implementation of this algorithm. Also the Python api of ConnectedComponent was modified to include a similar feature implemented in #213 only to the Scala apiThese changes were applied to both the Scala and Python apis