Skip to content

Conversation

@kronbichler
Copy link
Member

@kronbichler kronbichler commented Dec 21, 2021

While looking at #119 I found out that the evaluation of the pressure difference takes an extraordinary amount of time (even in parallel, it could take 30-50% of overall run time, as it is called every time step). This replaces the function by deal.II's VectorTools::point_values, which is based on the recently added fast evaluation of tensor product elements FEPointEvaluation and the remote evaluation capabilities by RemotePointEvaluation. The idea of the new code is that rank 0 (who will write the output) requests the points in the evaluation, which then get evaluated (and averaged in this case) and sent back to MPI rank 0.

This change is orthogonal to still missing GeometryInfo usage mentioned in #104, but it shows some techniques that might possibly be useful there as well.

Right now, I only replaced the two uses for the LinePlotCalculator and the PressureDifferenceCalculator. I believe we should also use this infrastructure for the line plot statistics, but that is more work. If we agree on that, we tackle that in the future. More generally, I believe we should also look into replacing all manual evaluations of shape functions, i.e., all users of the file https://github.com/exadg/exadg/blob/master/include/exadg/vector_tools/interpolate_solution.h , by some variant of the remote point evaluation. Of course, the main question is whether we can still keep reductions that are done locally to each MPI process, in order to avoid sending enormous amount of data in case we take averages e.g. over 2 out of 3 directions.

There is one caveat right now: In case we have a moving mesh, the point positions in reference space held by evaluation_cache will become invalid, which is not checked here. But in that case the question is whether we still want the evaluation along the same points. Maybe we can add a check? What do you think @nfehn?

@kronbichler kronbichler added code duplication Pull request contains duplicated code performance Changes that aim at improving the performance labels Dec 21, 2021
@kronbichler
Copy link
Member Author

As a remark, I tested this on the backward facing step and had to fix some minor issues there. Is there any additional test I should run?

@nfehn
Copy link
Member

nfehn commented Dec 21, 2021

I would probably test the pressure difference calculator for "flow around cylinder" and see whether the text files generated by the postprocessor are still the same. This at least tests the serial correctness of the code. Parallel correctness might be more involved (but I guess you know the deal.II code sufficiently well so that we do not have to test it explicitly here in ExaDG).

@nfehn
Copy link
Member

nfehn commented Dec 21, 2021

Could you explain how this is related to #104 (and adjust the description of this PR)?

@kronbichler
Copy link
Member Author

The proposed changes are applying a tool mentioned in #104 (VectorTools::point_values), but they do not address the code in question. The statistics calculators could probably use something similar, but I did not check closely yet.

@kronbichler
Copy link
Member Author

Regarding the flow past cylinder case: I ran it with 1 level of refinement and degree 4, and got 6 changes in the last digit for the whole output file. The change with several MPI processes (compared to a serial run) is much much bigger than this due to solver tolerances etc., changing at least 60% of lines in the last 2-3 digits. Similarly for the case set in the code, l=0, k=2, with only around 25 changes in the last digit. This looks promising. On the other hand, the postprocessing time for (l=1, k=4) went down from 18.3 seconds to 2.06 seconds.

@peterrum
Copy link
Contributor

On the other hand, the postprocessing time for (l=1, k=4) went down from 18.3 seconds to 2.06 seconds.

I like this :)

@nfehn
Copy link
Member

nfehn commented Dec 22, 2021

There is one caveat right now: In case we have a moving mesh, the point positions in reference space held by evaluation_cache will become invalid, which is not checked here. But in that case the question is whether we still want the evaluation along the same points. Maybe we can add a check? What do you think @nfehn?

Regarding ALE: I think this depends on whether you are interested in results in material of spatial (Lagrangian vs. Eulerian) description. With physical applications such as FSI in mind, I would argue that it is natural that the cell and the reference coordinates of the point of interest stay the same during the simulation, at least for points evaluated on the surface of a physical body (where ALE=Lagrangian, but actually also this might not be the case due to tangential motion of the mesh along the surface of the body). Now we could argue that we need to update the points of evaluation in the postprocessor in case of ALE methods whenever we need to evaluate something. However, this restricts us to an Eulerian description, since we cannot know the position of the point of interest as a function of time in case of coupled problems where the mesh motion is not prescribed analytically but an outcome of the simulation.

In essence, this means that we need to configure such routines with a boolean or enum (such as bool recompute_evaluation_points = false, the default argument for optimal performance in case of static meshes) that gives instructions what to do.

line != data.line_data.lines.end();
++line)
{
unsigned int n_points = (*line)->n_points;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we could add const (unsigned int const)

mapping_in);
}
else
evaluation_cache->reinit({}, dof_handler_pressure->get_triangulation(), mapping_in);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. Why is dof_handler_velocity relevant in the if-branch and dof_handler_pressure in the else branch? If not, we should extract the triangulation above or even add to the interface of the class.

Very generally, could you explain the logic between the if and else branch (why is an else branch necessary and what is done there?, why is reinit() called with {} in the else branch?)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is that MPI rank 0 requests the points to be evaluated (so it adds the points), whereas the other ranks to not request any point to be evaluated. Taking into account the other comment, is will be good to not call reinit in the if/else branches but collect the points to be evaluated.

data.line_data.lines.begin();
line != data.line_data.lines.end();
++line)
LinearAlgebra::distributed::Vector<double> ghosted_pressure(pressure.get_partitioner());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class LinePlotCalculator has the template parameter Number, so why do you use double here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an old bug in deal.II, which prevented evaluation with float numbers. When I update the code, I will put this into an ifdef for the 9.3 version of deal.II.

Comment on lines +93 to +104
ghosted_pressure.copy_locally_owned_data_from(pressure);
ghosted_pressure.update_ghost_values();

std::vector<double> pressure_values =
VectorTools::point_values<1>(*evaluation_cache, *dof_handler_pressure, ghosted_pressure);

LinearAlgebra::distributed::Vector<double> ghosted_velocity(velocity.get_partitioner());
ghosted_velocity.copy_locally_owned_data_from(velocity);
ghosted_velocity.update_ghost_values();

std::vector<Tensor<1, dim>> velocity_values =
VectorTools::point_values<dim>(*evaluation_cache, *dof_handler_velocity, ghosted_velocity);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This somehow seems to break the modularity of the code that we had before. Now, we always evaluate velocity and pressure, while - originally - fields have only been evaluated when needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The background is also the following: For typical flow examples or benchmarks, the velocity is evaluated along 10 lines (transveral to the main flow direction through the domain to obtain velocity profiles), while the pressure is evaluated along a single line (surface of a body in 2d or line along the surface in 3d).

if((*quantity)->type == QuantityType::Velocity)
{
std::vector<Tensor<1, dim, Number>> solution_vector(n_points);
// store all points along current line in a vector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment seems to be wrong now, and I currently also do not see where you actually use the variable n_points.

} // loop over lines
} // write_output == true

point_offset += n_points;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the only occurrence of n_points, please write (*line)->n_points here.

@nfehn
Copy link
Member

nfehn commented Dec 22, 2021

To justify the changes of the code in LinePlotCalculator (reducing the modularity of the code) by performance improvements, I think one needs to argue (i) which performance improvement comes from the faster evaluation routines of deal.II used in this PR, and (ii) which performance advantage comes from plugging together all the lines and evaluating everything (velocity, pressure, or even more quantities in the future, open-closed principle!) for every line (or for every point). For example, it would be more natural from an implementation point of view (that focuses on modularity) to have one RemotePointEvaluation for every line and then evaluate those quantities that are needed for the respective line.

evaluation_cache = std::make_shared<Utilities::MPI::RemotePointEvaluation<dim>>(
1e-8 * dof_handler_pressure->begin()->diameter());

// only request result on rank 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is "result" in this context?

Comment on lines +56 to +63
if(Utilities::MPI::this_mpi_process(mpi_comm) == 0)
{
std::vector<Point<dim>> points{data.point_1, data.point_2};
evaluation_cache->reinit(points, dof_handler_pressure->get_triangulation(), mapping_in);
}
else
evaluation_cache->reinit({}, dof_handler_pressure->get_triangulation(), mapping_in);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not filling the vector points in the if-else and then call the function reinit() only once afterwards? The same comment applies to other parts of the code with the same structure.

pressure_2, *dof_handler_pressure, *mapping, pressure, point_2, mpi_comm);

Number const pressure_difference = pressure_1 - pressure_2;
// default flag combination computes average from all cells
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is difficult to understand since I do not see default flags.

@kronbichler
Copy link
Member Author

To justify the changes of the code in LinePlotCalculator (reducing the modularity of the code) by performance improvements, I think one needs to argue (i) which performance improvement comes from the faster evaluation routines of deal.II used in this PR, and (ii) which performance advantage comes from plugging together all the lines and evaluating everything (velocity, pressure, or even more quantities in the future, open-closed principle!) for every line (or for every point). For example, it would be more natural from an implementation point of view (that focuses on modularity) to have one RemotePointEvaluation for every line and then evaluate those quantities that are needed for the respective line.

As you suspected, there are performance (and storage) reasons why I chose this implementation:

  • The lookup in RemotePointEvaluation uses some bounding boxes and other possibly expensive operations on the mesh, so gathering all points into a single object will help keep this auxiliary data small and actually gather the benefits of re-using things that have been computed internally before. I did not measure this effect, but I would argue that keeping a data structure for each point goes completely against those concepts.
  • If the number of lines is not too large, keeping a separate cache object per line should not be too expensive. But the problem is that as one extends use cases towards a larger number of points, things could get pretty bad performance-wise and memory-wise. I could not find a direct use of this line plot calculator class (or is there, I would really like to test this?), so it is hard to foresee whether the performance might be affected in the future.
  • Once the cells and reference coordinates within cells have been identified, the actual evaluation of solution fields is a very small portion of the cost in the unstructured case as implied by this interface (as opposed to the statistics evaluators where we have more structure), so even unnecessary evaluations will be lower-order contributions performance-wise. Of course, one could wish that there were a way to tell VectorTools::point_values that only a subset of the points known to it should get evaluated, or, alternatively, register different evaluation contexts into a single RemotePointEvaluation object, but that would need additional work on the data structures and the user interfaces. If you're interested to approach implementing this functionality, I'd be happy to help you with understanding the internals of the present function.
  • Part of my design is a fundamental software design pattern one takes when programming with MPI, namely to aggregate as much as possible into a single "communication step", because the latency of the communications steps is usually the expensive part. (There are of course exceptions where latency is not the relevant factor, but here it is. One could argue that this function is not as performance-critical to make it matter much, but just as the future extension can make modularity an issue it can also be the performance that gets the bottleneck via extensions.)

As said above, I could not quantify it on this particular function as I did not find a use example, but from overall experience with this function and the measurements with the point values for the pressure difference I would estimate that most of the benefit (in the order of a factor 100) comes from using the RemotePointEvaluation object with its built-in cache and reasonably low number of such objects. On top of that, having 1 or 20 such objects to make a modular design for each line and quantity to be evaluated should have a performance difference on the order of a factor of 2 with reasonably good MPI implementations, at the price of an unspecified increase in memory use. Also, keeping separate objects would lead to somewhat more code to keep the vectors of evaluation objects and fill them correctly, albeit with a simplified and modular access afterwards (as one cannot mix up the points from different lines).

To summarize, I see your point in terms of modularity, but I would take the liberty to suggest a performance-oriented design here for now (with extended comments of course, as suggested above). We can always adjust it later if we see a pressing need in another direction. Given that the current class is not really used much, I see little reason in spending much time refactoring it into a different shape, as long as we can keep the benefit of reduced code size, which was my original intent with this PR.

@nfehn
Copy link
Member

nfehn commented Jan 11, 2022

I am of a different opinion here.

The example that uses this code should be "cavity", see https://www.sciencedirect.com/science/article/pii/S0021999117306915, (you might have to set the parameter calculate = true in the application.h file). The argument that the code is not used a lot is not so strong in my opinion. L2-error calculation is used much more often in postprocessing routines in ExaDG, but does this imply that L2-error calculation (requiring analytical solutions) is more important for a CFD solver in general? As you have written yourself in the beginning, we might in the future want to extend the code proposed here towards the line plot calculations with statistics (which are used "more often" for turbulent flow examples).

Another main argument was that the performance is improved. However, with the argument that the code is currently not used a lot, I could also argue that performance is not super important here, in particular since this is a postprocessing routine for which we currently do not have a clear indication that it forms a bottleneck. So from my point of view, the main benefit of this PR is that it removes implementations that have in the meantime become available in dealii in a more generic way, reducing code maintenance effort for ExaDG.

So after the performance numbers that you gave (factor of 100 and factor of approx. 2), I would have been pretty sure that we might find the compromise to decide for improved flexibility here, and change it towards a more performant implementation with MPI-driven design once we encounter bottlenecks in the future. I consider this a "compromise" because the factor of 100 gives you a lot of performance improvements, but at the same time such a solution would respect the modular design that we had before. Regarding performance, I would bring the argument that under mesh refinement, neither the number of lines nor the number of points along a line will increase (the latter is the case because for such line plot results the discretization error is relevant rather than the interpolation error; in other words, the number of points along a line and the spatial resolution of the simulation are such that we will see the discretization error for practical CFD examples. Of course, one could adapt the resolution of the line plot to the spatial resolution of the simulation, but I think in practice one will fix this parameter with some conservative estimate in order to minimize the number of parameters for a practical flow problem). From this perspective, I would argue that the code under debate does not become critical for large-scale applications in particular with the new dealii-based implementation, but maybe I am wrong.

@nfehn
Copy link
Member

nfehn commented Jan 11, 2022

@kronbichler Would you like to open two issues (to not forget those aspects since the discussions in this PR are already quite lengthy)?

I believe we should also use this infrastructure for the line plot statistics, but that is more work. If we agree on that, we tackle that in the future.

More generally, I believe we should also look into replacing all manual evaluations of shape functions, i.e., all users of the file https://github.com/exadg/exadg/blob/master/include/exadg/vector_tools/interpolate_solution.h , by some variant of the remote point evaluation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

code duplication Pull request contains duplicated code performance Changes that aim at improving the performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants