Skip to content

Conversation

@ryandens
Copy link
Member

@ryandens ryandens commented Jun 10, 2021

✨ Capture request bodies on Undertow

Background

  • Undertow is a web server written in Java
  • It's leveraged by/integrated with several large projects including Wildfly and RESTEasy.
  • Projects that build on top of Undertow often leverage its Java EE Servlet API Implementation
  • Its implementation of the Servlet API does not always leverage the Servlet API javax.servlet.http.HttpServletRequest.getInputStream()

Overview

  • ✨ Adds Instrumentation to associate org.xnio.channels.StreamSourceChannel instances with a SpanAndBuffer instance to give the proper context
  • ✨ Adds instrumentation to copy data from a ByteBuffer after it is read from an org.xnio.channels.StreamSourceChannel into a ByteBuffer and associate that data with the request body Span attribute.

Testing

  • ✅ Adds a test case that reproduces a situation where our existing implementation previously did not capture request body on an Undertow deployment that we would otherwise expect to work on
  • Manual tests on a Wildfly servlet container with a RESTEasy JAX-RS deployment

Checklist:

  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • Any dependent changes have been merged and published in downstream modules

Documentation

No new documentation is required, as we broadly claim Java Servlet support of which this undertow use-case is an implementation. This just happens to be a corner case.

Copy link
Member

@pavolloffay pavolloffay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good progress 💯

A couple of things to look at:

  • avoid using final for local variables in short methods.
  • the body capture should for content type and capture only specific types - see the spec in the HT org. Also check for content length to preallocate size of the buffer and charset.
  • the instrumentation code should use HT config to check if capture is enabled.

HypertraceCallDepthThreadLocalMap.reset(StreamSourceChannel.class);
final ContextStore<StreamSourceChannel, SpanAndObjectPair> contextStore =
InstrumentationContext.get(StreamSourceChannel.class, SpanAndObjectPair.class);
final SpanAndObjectPair spanAndObjectPair = contextStore.get(streamSourceChannel);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove final it does not help here and it just adds noise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree here - I find that defaulting local variables to final discourages mutating references to objects and results in code that is easier to read and maintain.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final is useful in large methods where there is a higher probability of making a mistake by accidentally reusing the variable. The final should be used to improve the readability of large code blocks or as a "documentation" of the design.
In this PR the final keyword is used for all local variables in very short methods (e.g. 5-10 LOC), which does not make sense to me.

The objective is to write as little code as possible because that is the easiest to read, maintain.... Look at Java std lib/JDK implementation or read the effective java book.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a big fan of effective java! I don't recall an item about final and/or non-final local variables, but I personally find that flipping the switch in the IDE and defaulting to final local variables makes the code easier to maintain and build upon over time. The cost of making variables final is low (because it can be done automatically) and IMO it helps make the code a little more readable because its clear at variable declaration time when a variable is reassigned, rather than relying on the reader to figure out whether a variable is effectively final (because it's never reassigned) or if it's actually mutated. This scales well over time as a function organically grows in length over time and is modified by multiple authors, who may or may not have context on what the previous person wrote. I guess I just disagree that the least amount of code possible is the easiest to read and maintain. A good counterexample to that point, in my opinion, is how books like effective java encourage the use of inheritance over composition. In general, using composition rather than inheritance results in more code (as measured by LOC) but most folks agree that composition is a better default method of sharing code because it is easier to understand what is happening when read in the future. That's not to say that there isn't a place for inheritance (or non-final local variables), just that it's part of what influences what I tend to do by default.

That being said, in the context of this PR it isn't terribly important, so if you feel strongly about this I'd be happy to table the discussion for another time and remove the final keyword from local variables in this PR. We can address it at a time when we can also add in automation to enforce whatever decision we come to.

@ryandens ryandens force-pushed the undertow-http-bodies branch from 56b92a9 to 71ec0bd Compare June 14, 2021 21:37
@ryandens ryandens changed the title 🐛 Capture HTTP bodies on Undertow when not accessed via the Servlet API (Draft, work in progress) ✨ Capture HTTP bodies on Undertow Jun 15, 2021
@ryandens ryandens marked this pull request as ready for review June 15, 2021 05:09
Copy link
Member

@pavolloffay pavolloffay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undertow support should be added to the readme.

Should this instrumentation capture headers as well? This is the first servlet container instrumentation in this project. Could the payloads be captured twice? E.g. first in undertow and then in the servlet instrumentation?

final Charset charset = Charset.forName(httpServerExchange.getRequestCharset());
final BoundedByteArrayOutputStream boundedByteArrayOutputStream =
BoundedBuffersFactory.createStream(
(int) httpServerExchange.getRequestContentLength(), charset);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what size is used if the content length is 0 e.g. for chunked transfer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BoundedBuffersFactory has some logic to help us out with that. In the case of stream being created in this way, we'll simply use the default ByteArrayOutputStream constructor which uses the default initial size of 32. Regardless, the ByteArrayOutputStream will grow to fit the size of its content as we write to it (until restricted by the BoundedByteArrayOutputStream). I'm sure we can improve on this, but I'm trying to keep this PR scoped to the bug with the customer as I don't have great data on how/when we can improve this right now

@ryandens ryandens force-pushed the undertow-http-bodies branch 2 times, most recently from 5bd5a0e to f7ae59c Compare June 22, 2021 17:46
@ryandens
Copy link
Member Author

Undertow support should be added to the readme.

I'm avoiding adding undertow support to the readme because we specifically only will only fully work on Undertow servlets. Due to the implementation details of APIs like io.undertow.servlet.spec.HttpServletRequestImpl.getParameterMap(), we know that the request body won't always be accessed via the standard servlet body access APIs getReader and getInputStream() so we add this as a way to support servlets running on Undertow, but it does not mean that we support running on a raw Undertow HTTP server ( we don't).

Should this instrumentation capture headers as well?

Theoretically, we could add that over time, but the goal of this PR is to solve the request bodies missing when running undertow with Servlets, one of our supported use cases. Right now, adding full-out undertow support isn't our top priority

This is the first servlet container instrumentation in this project. Could the payloads be captured twice? E.g. first in Undertow and then in the servlet instrumentation?

I'm really glad you asked this! This was indeed happening when you brought this up. The solution I came up with was to add some lightweight instrumentation to undertow-servlet to mark and store in the instrumentation context whether or not an HttpServerExchange should be instrumented with undertow specific body-capturing.

* #APP_SERVER} instrumentation can short-circuit when it is know that that the preferred {@link
* #SERVLET} instrumentation is sufficient to capture the request body
*/
public enum RequestBodyCaptureMethod {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We literally do not need this type in the javaagent-core. javaagent-core should contain only types that are used by multiple instrumentations. Which does not seem to be the case here. This type can be moved to undertow module. Also only SERVLET value is used in the codebase.

I think I know your intentions with this type and overall approach, however it won't work as a generic solution across instrumentations. We should design an API that will tell you if payload or headers for a specific request were captured and it should work for both client and server instrumentations (clients have the same problem e.g. jax-rs client implementation is often times backed by apache HTTP client...). The approach in this PR won't work bc it uses instrumentation context with specific types 3rd party library types e.g. io.undertow.server.HttpServerExchange.

You might want to look at https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/3e28b01e42311d755db17d698966d8f3bfaea1e3/instrumentation-api/src/main/java/io/opentelemetry/instrumentation/api/instrumenter/Instrumenter.java#L88 to see how to design this using the APIs that are available for all instrumentations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely agreed, I'll move this type into a common library project shared by both undertow instrumentation projects. Over time, when we have more use-cases like this one, we can design a more general solution. This was definitely premature 🙂

final BoundedByteArrayOutputStream boundedByteArrayOutputStream = spanAndBuffer.byteArrayBuffer;
for (int i = 0; i < numBytesRead; i++) {
final byte b = readOnlyBuffer.get();
boundedByteArrayOutputStream.write(b);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not seem efficient, ByteBuffer should have an API to return the allocated array.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ByteBuffer.array() does exist, but it's not guaranteed to work (depending on the kind of ByteBuffer being used. As far as I know, it does not work on DirectByteBuffers, only on heap-based buffers. In the case of undertow, the ByteBuffer subclass being used in the test case is a DirectByteBuffer so ByteBuffer.array() will throw an UnsupportedOperationException.

My first pass at this, after discovering that ByteBuffer.array() does not suit our needs, was to do something like:

byte[] dst = new byte[numBytesRead];
readOnlyBuffer.get(dst);
boundedByteArrayOutputStream.write(dst);

However, after doing this, I looked at the implementation of ByteBuffer.get(byte[], int, int). It looks like this:

public ByteBuffer get(byte[] dst, int offset, int length) {
  checkBounds(offset, length, dst.length);
  if (length > remaining())
    throw new BufferUnderflowException();
  int end = offset + length;
  for (int i = offset; i < end; i++)
      dst[i] = get();
      return this;
}

Note ByteBuffer.get() is called n times, as it is in this method as well. So, by calling ByteBuffer.get() directly here, it is invoked the same number of times as ByteBuffer.get(byte[]), except that we get to avoid allocating a duplicate byte[] just to have it eligible for GC at the end of this utility method. This could put additional GC pressure on the application and result in degraded performance. I'm definitely open to other options of copying data from the ByteBuffer! But as far as I know, this is our best option today.

@ryandens ryandens requested a review from pavolloffay June 23, 2021 14:57
@ryandens
Copy link
Member Author

Hey @pavolloffay, thanks for taking the time to review these changes! Do you have any further comments/concerns about this PR?

Copy link
Contributor

@davexroth davexroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving this PR as is, as it's time sensitive.

Any additional changes regarding code style etc can be picked up in a later PR.

Copy link
Member

@pavolloffay pavolloffay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why there are 3 gradle modules? Any real justification?

There is:

  • common
  • undertow
  • undertow-servlet

Does the undertow instrumentation work without servlet? I guess the servlet depends on the core so if the core is not reused I would merge them together.

* Signals that request body should be captured relying on the instrumentation of APIs defined
* inside the app server's implementation that do not relate to {@code javax.servlet} APIs
*/
APP_SERVER
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is not being used anywhere


dependencies {
implementation("io.opentelemetry.javaagent.instrumentation:opentelemetry-javaagent-undertow-1.4:${versions["opentelemetry_java_agent"]}")
library("io.undertow:undertow-core:2.0.0.Final")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be 1.4.0. It is what muzzle check is configred against. The idea is to use the lowest version possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me 👍 FWIW i copied this part from the OTEL undertow module

@ryandens
Copy link
Member Author

Why there are 3 gradle modules? Any real justification?

There is:

* common

* undertow

* undertow-servlet

Does the undertow instrumentation work without servlet? I guess the servlet depends on the core so if the core is not reused I would merge them together.

Yes, the common module is depend on by both the undertow module and the undertow-servlet module. The undertow module works without the servlet module because we can't count on undertow-servlet always being used. The undertow-servlet insrumentation module serves to add instrumentation to undertow-servlet APIs in a way that can be detected by our undertow-core instrumentation. This allows us to intelligently avoid capturing the reqeust body twice depeneidng on on what APIs are used to access it/in what context.

ryandens added 12 commits June 25, 2021 14:21
✨ add noop instrumentation module for Undertow

🐛 dont assert inverse for now

Revert "♻️ move more code from the servlet into utils to make it debuggable"

This reverts commit 765c37a.

:white_check_mark: add failing test for bug

:heavy_plus_sign: add undertwo as implementaiton dependency

:construction: put SpanAndObjectPair in Instrumentationcontext where the context key is the channel returned by getRequestChannel and the object associated with the channel is the HttpServerExchange

:sparkles: instrument StreamSourceChannel#read to add the request body to the span when the channel is read

:heavy_minus_sign: move servlet API from library configuration to just testImplementation as it is not need on the main compile classpath

:construction: WIP avoid String.trim() of request bodies by only writing the number of bytes that were read by StreamSourceChannel.read()

:recycle: change type of SpanAndBuffer to be a BoundedByteArrayOutputStream

:recycle: refactor UndertowHttpServerExchangeInstrumentation to store SpanAndBuffer instead of SpanAndObjectPair

:recycle: refactor StreamSourceChannelInstrumentation to use SpanAndBuffer object and write to bounded outputstream without allocating an extra byte[]

:fire: remove unused utils

:bug: update InputStreamInstrumentationModuleTest to use Bounded OuptutStream

:bug: fix code for muzzle

:ok_hand: move logger to top level, remove unneeded log statements

:ok_hand: update class names to start with uppercase

:rocket: capture JAR as artifact

:ok_hand: only capture request body if configured to do so

:ok_hand: properly handle Throwable for callDepth

:white_check_mark: assertInverse

:art: effectively final local variable

:art: formatting of comment

:bulb: add some javadoc

:recycle: specify Advice classes by reference rather than with Strings

:bug: handle -1 content length
@ryandens ryandens force-pushed the undertow-http-bodies branch from 739e435 to 61d9630 Compare June 25, 2021 18:22
@ryandens ryandens merged commit a2333cf into main Jun 25, 2021
@ryandens ryandens deleted the undertow-http-bodies branch June 25, 2021 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants