Fixed extractText()-Not returning text with spaces#569
Fixed extractText()-Not returning text with spaces#569MartinThoma merged 1 commit intopy-pdf:masterfrom inboxsgk:Fix---extractText-Spacing-Problem
Conversation
Previously the function .extractText() reads the text in the PDF and returns without any spaces. In this fix the pdf.py file has been modified to add " " (space) in between two words Here is an example below:- Original Sentence : "The quick brown fox jumps over the lazy dog" Previous Output : "Thequickbrownfoxjumpsoverthelazydog" After the fix : "The quick brown fox jumps over the lazy dog"
|
Thank you for the contribution! I'm sorry that it took so long - I try to be quicker in future 🤞 |
Features: - Add alpha channel support for png files in Script (#614) Bug fixes (BUG): - Fix formatWarning for filename without slash (#612) - Add whitespace between words for extractText() (#569, #334) - "invalid escape sequence" SyntaxError (#522) - Avoid error when printing warning in pythonw (#486) - Stream operations can be List or Dict (#665) Documentation (DOC): - Added Scripts/pdf-image-extractor.py - Documentation improvements (#550, #538, #324, #426, #394) Tests and Test setup (TST): - Add Github Action which automatically run unit tests via pytest and static code analysis with Flake8 (#660) - Add several unit tests (#661, #663) - Add .coveragerc to create coverage reports Developer Experience Improvements (DEV): - Pre commit: Developers can now `pre-commit install` to avoid tiny issues like trailing whitespaces Miscallenious: - Add the LICENSE file to the distributed packages (#288) - Use setuptools instead of distutils (#599) - Improvements for the PyPI page (#644) - Python 3 changes (#504, #366) You can see the full changelog at: 1.26.0...1.27.0
|
Could you please show in which directory can be found the pyPDF2 source file containing the " extractText() " method please ? |
|
It's in _page.py |
|
I just found in "PyPDF2" files (outside of the pycache folder) the -page.py ... problem is it only has 1000 or so linges whereas the ones modified on ghit have about 3000 of those ... maybe I don't have the right file or version (yet I installed the package yesterday :/) |
|
_page.py * |
|
I just modified my " _page.py " file and copy pasted the one on git here... still not working, if you don't mind of course, could you tell me where might be the problem |
|
@Viennoiserie / @inboxsgk , |
|
I am trying to make a function (for my webapp) that can append all the words contained in the pdf into an array. The app then finds the words asked by the user... so I went onto word : wrote text that would be " hard " for python to work with and the results aren't the ones I wanted: I expect: ['Thomas', 'Vienot', 'CACA', 'Partie'] but I get: ['Thomas', 'VienotCACA', 'Partie'] |
|
If you want I can also provide you my code (nothing to complex but I think it should work): from PyPDF2 import PdfFileReader def pdf_to_words(file_name): def main(): if name == "main": |
|
@Viennoiserie, |
|
Thank you, indeed, I have tried my program on other PDFs and there was no problem :/ |
Previously the function
.extractText()reads the text in the PDF and returns without any spaces.In this fix the pdf.py file has been modified to add " " (space) in between two words
Here is an example below:-
Original Sentence :
"The quick brown fox jumps over the lazy dog"Previous Output :
"Thequickbrownfoxjumpsoverthelazydog"Output After fix :
"The quick brown fox jumps over the lazy dog"