Skip to content

Conversation

@lfoppiano
Copy link
Collaborator

This PR is implementing the styles italic, bold superscript and subscript in the output xml.
See information at #160

@coveralls
Copy link

coveralls commented Jul 25, 2022

Coverage Status

coverage: 40.696% (+0.8%) from 39.903%
when pulling 188cda5 on feature/add-styles-xml
into be9e652 on master.

@lfoppiano lfoppiano marked this pull request as ready for review September 13, 2022 06:57
@kermitt2 kermitt2 added this to the 0.7.3 milestone Nov 7, 2022
@kermitt2 kermitt2 modified the milestones: 0.7.3, 0.8.0 Apr 23, 2023
@kermitt2
Copy link
Owner

Hi @lfoppiano !

This branch will require quite a few tests I think (I suspect it will raise problems to some of the grobid modules and I need to check the consistency with Pub2TEI), so I pushed its release to version 0.8.0.

One thing related to "document structure" versus "narrative style" is the bold style for section titles. I think it's like the italic/bold for the reference markers, the logical "section title" structure is already captured by the <head> element, so I would ignore the style for all section titles.

For example in the attached pdf, the style should be ignored here:

           <div xmlns="http://www.tei-c.org/ns/1.0">
                <head n="1"><hi rend="bold">Introduction</hi></head>

In contrast, the style here should be kept because it corresponds to an highlight within the flow of the paragraph text:

                <p>12. <hi rend="bold">Average tf-idf similarity between citance and title of the cited paper (F12):</hi> We calculate the similarity of each citance with the title of the cited paper and take an average of it.</p>
                <p>13. <hi rend="bold">Maximum tf-idf similarity between citance and title of the cited paper (F13):</hi> We take the maximum of similarity of the citances with the title of the cited paper.</p>

Does it make sense?

qss_a_00170.pdf

@lfoppiano
Copy link
Collaborator Author

@kermitt2 yes, no problem to push it further.

OK to the change you propose.

@lfoppiano lfoppiano self-assigned this Apr 27, 2023
@lfoppiano
Copy link
Collaborator Author

The crazy part was to merge the master back in this branch 😅

For example in the attached pdf, the style should be ignored here:

I've made the change and now the text within the <head> will not have the style applied:

<div
                xmlns="http://www.tei-c.org/ns/1.0">
                <head n="1">Introduction</head>
                <p>Literature searches are crucial to discover

In contrast, the style here should be kept because it corresponds to an highlight within the flow of the paragraph text:

                <p>12. <hi rend="bold">Average tf-idf similarity between citance and title of the cited paper (F12):</hi> We calculate the similarity of each citance with the title of the cited paper and take an average of it.</p>
                <p>13. <hi rend="bold">Maximum tf-idf similarity between citance and title of the cited paper (F13):</hi> We take the maximum of similarity of the citances with the title of the cited paper.</p>

I'm not sure what you mean in this case 🙂

Luca Foppiano added 3 commits December 17, 2023 19:41
# Conflicts:
#	grobid-core/src/main/java/org/grobid/core/document/TEIFormatter.java
#	grobid-core/src/test/java/org/grobid/core/document/TEIFormatterTest.java
@lfoppiano lfoppiano removed this from the 0.8.0 milestone Jun 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sub/superscript are displayed as plain text characters in the TEI output

4 participants