How to detect pages in PDF files that are actually two-pages?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}
I have a PDF (it does not contain scanned images), each page of which is actually two-page, like this:
However there are some normal pages, so when I wrote a program to convert the file to normal pages I have to scroll through the file and identify the exception pages and write it to a list so that the program may know specifically which page not to cut in half (I used mutool for the cutting, it works for this type of file).
So how can I detect which page is normal and which page is not? Please help me, thank you very much.
add a comment |
I have a PDF (it does not contain scanned images), each page of which is actually two-page, like this:
However there are some normal pages, so when I wrote a program to convert the file to normal pages I have to scroll through the file and identify the exception pages and write it to a list so that the program may know specifically which page not to cut in half (I used mutool for the cutting, it works for this type of file).
So how can I detect which page is normal and which page is not? Please help me, thank you very much.
You can try to play withpdfinfo
frompoppler-utils
package (or other utils from it).
– N0rbert
Mar 29 at 12:49
@N0rbert thank you very much, I found out that pdfinfo did not help much, but pdftohtml helped a lot, because I can use the html converted from the pdf as input to the regex and this helps me detect the single-pages. It is admittedly not optimal, but it is acceptable for me
– Dang Manh Truong
Mar 30 at 9:29
add a comment |
I have a PDF (it does not contain scanned images), each page of which is actually two-page, like this:
However there are some normal pages, so when I wrote a program to convert the file to normal pages I have to scroll through the file and identify the exception pages and write it to a list so that the program may know specifically which page not to cut in half (I used mutool for the cutting, it works for this type of file).
So how can I detect which page is normal and which page is not? Please help me, thank you very much.
I have a PDF (it does not contain scanned images), each page of which is actually two-page, like this:
However there are some normal pages, so when I wrote a program to convert the file to normal pages I have to scroll through the file and identify the exception pages and write it to a list so that the program may know specifically which page not to cut in half (I used mutool for the cutting, it works for this type of file).
So how can I detect which page is normal and which page is not? Please help me, thank you very much.
asked Mar 29 at 12:28
Dang Manh TruongDang Manh Truong
51119
51119
You can try to play withpdfinfo
frompoppler-utils
package (or other utils from it).
– N0rbert
Mar 29 at 12:49
@N0rbert thank you very much, I found out that pdfinfo did not help much, but pdftohtml helped a lot, because I can use the html converted from the pdf as input to the regex and this helps me detect the single-pages. It is admittedly not optimal, but it is acceptable for me
– Dang Manh Truong
Mar 30 at 9:29
add a comment |
You can try to play withpdfinfo
frompoppler-utils
package (or other utils from it).
– N0rbert
Mar 29 at 12:49
@N0rbert thank you very much, I found out that pdfinfo did not help much, but pdftohtml helped a lot, because I can use the html converted from the pdf as input to the regex and this helps me detect the single-pages. It is admittedly not optimal, but it is acceptable for me
– Dang Manh Truong
Mar 30 at 9:29
You can try to play with
pdfinfo
from poppler-utils
package (or other utils from it).– N0rbert
Mar 29 at 12:49
You can try to play with
pdfinfo
from poppler-utils
package (or other utils from it).– N0rbert
Mar 29 at 12:49
@N0rbert thank you very much, I found out that pdfinfo did not help much, but pdftohtml helped a lot, because I can use the html converted from the pdf as input to the regex and this helps me detect the single-pages. It is admittedly not optimal, but it is acceptable for me
– Dang Manh Truong
Mar 30 at 9:29
@N0rbert thank you very much, I found out that pdfinfo did not help much, but pdftohtml helped a lot, because I can use the html converted from the pdf as input to the regex and this helps me detect the single-pages. It is admittedly not optimal, but it is acceptable for me
– Dang Manh Truong
Mar 30 at 9:29
add a comment |
1 Answer
1
active
oldest
votes
After playing with several utilities from the
poppler-utils
package, I finally arrived at an acceptable, but not optimal, solution.
It turns out that detecting double-pages in PDF files is a rather tricky business. I was unable to find any library that can do so easily. So in the end, I decided to use
pdftohtml
, which is a tool from the
poppler-utils
package, to convert each page into html, then use regular expression to extract pages that are not double pages. Interestingly, I was able to get most of the cases correctly by just using one or two lines in the html file. It does not work on all cases, as there are double-pages that are marked as single-page, but it seems that there is no single-page that is marked as double-page so there is no risk of damaging the original file.
Here is what I did: I mostly depended on detecting the header number, which is in almost cases the first line of the html file (of course after several lines that are the same across all pages)
I used the fact that in the introduction of the file, the header number uses Roman numbering, so I used the corresponding regex:
if re.findall('<a name=[0-9]*></a>[XIVLCDM]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[XIVLCDM]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*<br/>', line):
Another thing I noticed is that if the line (actually the 31st line, since the first 30 lines are the same across all the pages) contains image link then it is likely not needed to be cut in half (there are cases where the left page is blank and the right page contains an image, but these are few and far between, so I just have to iterate through each page in the result and remove those that are double-page). I simply search for the string "img".
I also found out that double-pages contain the header number right at the beginning, so I simply used:
if re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[0-9]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[0-9]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>V. I. L ª - n i n &#[0-9]*;<br/>', line):
(the last line is due to some special pages that need special treatment)
In the end, it does not detect all single-pages but the good thing is that it does not wrongly consider any single-page as double-page, so suppose the result is [1, 5, 100] then I can simply iterate through the list and check visually for each case. Although this is still not completely automated but this is much much better than having to check each single page.
For those interesed, here is my code (in Python 2.7):
# -*- coding: utf-8 -*-
#!/usr/bin/python
#
import re
import pdb
import os
import errno
import subprocess
# Find pages that are not double page
# OS: Ubuntu
# Requirements: Python 2.7, pdftohtml
def silentremove(filename):
try:
os.remove(filename)
except OSError as e: # this would be "except OSError, e:" before Python 2.6
if e.errno != errno.ENOENT: # errno.ENOENT = no such file or directory
raise # re-raise exception if a different error occurred
num_of_pages = 395
input = "Lenin06.pdf"
excps =
i = 1
with open(input, 'rt') as fid:
while 1:
if i > num_of_pages:
break
if (i == 1) or (i == 2):
excps.append(str(i))
i += 1
continue
if (i == 3) or (i == 4):
i += 1
continue
cmd = "pdftohtml -i %s -f %d -l %d" % (input, i, i)
os.system(cmd)
html_file = input[:-4] + "s.html"
with open(html_file, 'rt') as html_fid:
for j in range(30):
line = html_fid.readline()
line = html_fid.readline()
line = line.strip()
if re.findall("img", line):
excps.append(str(i))
else:
if re.findall('<a name=[0-9]*></a>&#[0-9]*;<br/>', line):
excps.append(str(i))
else:
if re.findall('<a name=[0-9]*></a>[XIVLCDM]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[XIVLCDM]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*<br/>', line):
# Loi tua (Introduction)
silentremove(input[:-4] + ".html")
silentremove(input[:-4] + "_ind.html")
silentremove(input[:-4] + "s.html")
i += 1
continue
else:
if re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[0-9]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[0-9]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>V. I. L ª - n i n &#[0-9]*;<br/>', line):
# print "haha"
# Trang doi (Double page)
silentremove(input[:-4] + ".html")
silentremove(input[:-4] + "_ind.html")
silentremove(input[:-4] + "s.html")
i += 1
continue
else:
if re.findall('<a name=[0-9]*></a>[^0-9&#;]* <br/>', line) and
re.findall('^[0-9]*&#[0-9]*;<br/>$', html_fid.readline().strip()):
# 1 so truong hop trang trai trong, trang phai co chu
# (Some cases where the left page is blank while the right page contains
# text)
silentremove(input[:-4] + ".html")
silentremove(input[:-4] + "_ind.html")
silentremove(input[:-4] + "s.html")
i += 1
continue
else:
excps.append(str(i))
pass
pass
pass
silentremove(input[:-4] + ".html")
silentremove(input[:-4] + "_ind.html")
silentremove(input[:-4] + "s.html")
i += 1
pass
for file in os.listdir("./"):
if file.endswith(".png") or file.endswith(".jpg") or file.endswith(".jpeg"):
silentremove(file)
pass
pdb.set_trace()
And this is the file: https://drive.google.com/open?id=1vjnebt3xEuY8odhZHPwL8pf26l8ySdnE (this is just an example, I have many more that needs to be converted to single-pages)
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "89"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1129677%2fhow-to-detect-pages-in-pdf-files-that-are-actually-two-pages%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
After playing with several utilities from the
poppler-utils
package, I finally arrived at an acceptable, but not optimal, solution.
It turns out that detecting double-pages in PDF files is a rather tricky business. I was unable to find any library that can do so easily. So in the end, I decided to use
pdftohtml
, which is a tool from the
poppler-utils
package, to convert each page into html, then use regular expression to extract pages that are not double pages. Interestingly, I was able to get most of the cases correctly by just using one or two lines in the html file. It does not work on all cases, as there are double-pages that are marked as single-page, but it seems that there is no single-page that is marked as double-page so there is no risk of damaging the original file.
Here is what I did: I mostly depended on detecting the header number, which is in almost cases the first line of the html file (of course after several lines that are the same across all pages)
I used the fact that in the introduction of the file, the header number uses Roman numbering, so I used the corresponding regex:
if re.findall('<a name=[0-9]*></a>[XIVLCDM]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[XIVLCDM]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*<br/>', line):
Another thing I noticed is that if the line (actually the 31st line, since the first 30 lines are the same across all the pages) contains image link then it is likely not needed to be cut in half (there are cases where the left page is blank and the right page contains an image, but these are few and far between, so I just have to iterate through each page in the result and remove those that are double-page). I simply search for the string "img".
I also found out that double-pages contain the header number right at the beginning, so I simply used:
if re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[0-9]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[0-9]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>V. I. L ª - n i n &#[0-9]*;<br/>', line):
(the last line is due to some special pages that need special treatment)
In the end, it does not detect all single-pages but the good thing is that it does not wrongly consider any single-page as double-page, so suppose the result is [1, 5, 100] then I can simply iterate through the list and check visually for each case. Although this is still not completely automated but this is much much better than having to check each single page.
For those interesed, here is my code (in Python 2.7):
# -*- coding: utf-8 -*-
#!/usr/bin/python
#
import re
import pdb
import os
import errno
import subprocess
# Find pages that are not double page
# OS: Ubuntu
# Requirements: Python 2.7, pdftohtml
def silentremove(filename):
try:
os.remove(filename)
except OSError as e: # this would be "except OSError, e:" before Python 2.6
if e.errno != errno.ENOENT: # errno.ENOENT = no such file or directory
raise # re-raise exception if a different error occurred
num_of_pages = 395
input = "Lenin06.pdf"
excps =
i = 1
with open(input, 'rt') as fid:
while 1:
if i > num_of_pages:
break
if (i == 1) or (i == 2):
excps.append(str(i))
i += 1
continue
if (i == 3) or (i == 4):
i += 1
continue
cmd = "pdftohtml -i %s -f %d -l %d" % (input, i, i)
os.system(cmd)
html_file = input[:-4] + "s.html"
with open(html_file, 'rt') as html_fid:
for j in range(30):
line = html_fid.readline()
line = html_fid.readline()
line = line.strip()
if re.findall("img", line):
excps.append(str(i))
else:
if re.findall('<a name=[0-9]*></a>&#[0-9]*;<br/>', line):
excps.append(str(i))
else:
if re.findall('<a name=[0-9]*></a>[XIVLCDM]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[XIVLCDM]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*<br/>', line):
# Loi tua (Introduction)
silentremove(input[:-4] + ".html")
silentremove(input[:-4] + "_ind.html")
silentremove(input[:-4] + "s.html")
i += 1
continue
else:
if re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[0-9]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[0-9]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>V. I. L ª - n i n &#[0-9]*;<br/>', line):
# print "haha"
# Trang doi (Double page)
silentremove(input[:-4] + ".html")
silentremove(input[:-4] + "_ind.html")
silentremove(input[:-4] + "s.html")
i += 1
continue
else:
if re.findall('<a name=[0-9]*></a>[^0-9&#;]* <br/>', line) and
re.findall('^[0-9]*&#[0-9]*;<br/>$', html_fid.readline().strip()):
# 1 so truong hop trang trai trong, trang phai co chu
# (Some cases where the left page is blank while the right page contains
# text)
silentremove(input[:-4] + ".html")
silentremove(input[:-4] + "_ind.html")
silentremove(input[:-4] + "s.html")
i += 1
continue
else:
excps.append(str(i))
pass
pass
pass
silentremove(input[:-4] + ".html")
silentremove(input[:-4] + "_ind.html")
silentremove(input[:-4] + "s.html")
i += 1
pass
for file in os.listdir("./"):
if file.endswith(".png") or file.endswith(".jpg") or file.endswith(".jpeg"):
silentremove(file)
pass
pdb.set_trace()
And this is the file: https://drive.google.com/open?id=1vjnebt3xEuY8odhZHPwL8pf26l8ySdnE (this is just an example, I have many more that needs to be converted to single-pages)
add a comment |
After playing with several utilities from the
poppler-utils
package, I finally arrived at an acceptable, but not optimal, solution.
It turns out that detecting double-pages in PDF files is a rather tricky business. I was unable to find any library that can do so easily. So in the end, I decided to use
pdftohtml
, which is a tool from the
poppler-utils
package, to convert each page into html, then use regular expression to extract pages that are not double pages. Interestingly, I was able to get most of the cases correctly by just using one or two lines in the html file. It does not work on all cases, as there are double-pages that are marked as single-page, but it seems that there is no single-page that is marked as double-page so there is no risk of damaging the original file.
Here is what I did: I mostly depended on detecting the header number, which is in almost cases the first line of the html file (of course after several lines that are the same across all pages)
I used the fact that in the introduction of the file, the header number uses Roman numbering, so I used the corresponding regex:
if re.findall('<a name=[0-9]*></a>[XIVLCDM]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[XIVLCDM]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*<br/>', line):
Another thing I noticed is that if the line (actually the 31st line, since the first 30 lines are the same across all the pages) contains image link then it is likely not needed to be cut in half (there are cases where the left page is blank and the right page contains an image, but these are few and far between, so I just have to iterate through each page in the result and remove those that are double-page). I simply search for the string "img".
I also found out that double-pages contain the header number right at the beginning, so I simply used:
if re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[0-9]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[0-9]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>V. I. L ª - n i n &#[0-9]*;<br/>', line):
(the last line is due to some special pages that need special treatment)
In the end, it does not detect all single-pages but the good thing is that it does not wrongly consider any single-page as double-page, so suppose the result is [1, 5, 100] then I can simply iterate through the list and check visually for each case. Although this is still not completely automated but this is much much better than having to check each single page.
For those interesed, here is my code (in Python 2.7):
# -*- coding: utf-8 -*-
#!/usr/bin/python
#
import re
import pdb
import os
import errno
import subprocess
# Find pages that are not double page
# OS: Ubuntu
# Requirements: Python 2.7, pdftohtml
def silentremove(filename):
try:
os.remove(filename)
except OSError as e: # this would be "except OSError, e:" before Python 2.6
if e.errno != errno.ENOENT: # errno.ENOENT = no such file or directory
raise # re-raise exception if a different error occurred
num_of_pages = 395
input = "Lenin06.pdf"
excps =
i = 1
with open(input, 'rt') as fid:
while 1:
if i > num_of_pages:
break
if (i == 1) or (i == 2):
excps.append(str(i))
i += 1
continue
if (i == 3) or (i == 4):
i += 1
continue
cmd = "pdftohtml -i %s -f %d -l %d" % (input, i, i)
os.system(cmd)
html_file = input[:-4] + "s.html"
with open(html_file, 'rt') as html_fid:
for j in range(30):
line = html_fid.readline()
line = html_fid.readline()
line = line.strip()
if re.findall("img", line):
excps.append(str(i))
else:
if re.findall('<a name=[0-9]*></a>&#[0-9]*;<br/>', line):
excps.append(str(i))
else:
if re.findall('<a name=[0-9]*></a>[XIVLCDM]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[XIVLCDM]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*<br/>', line):
# Loi tua (Introduction)
silentremove(input[:-4] + ".html")
silentremove(input[:-4] + "_ind.html")
silentremove(input[:-4] + "s.html")
i += 1
continue
else:
if re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[0-9]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[0-9]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>V. I. L ª - n i n &#[0-9]*;<br/>', line):
# print "haha"
# Trang doi (Double page)
silentremove(input[:-4] + ".html")
silentremove(input[:-4] + "_ind.html")
silentremove(input[:-4] + "s.html")
i += 1
continue
else:
if re.findall('<a name=[0-9]*></a>[^0-9&#;]* <br/>', line) and
re.findall('^[0-9]*&#[0-9]*;<br/>$', html_fid.readline().strip()):
# 1 so truong hop trang trai trong, trang phai co chu
# (Some cases where the left page is blank while the right page contains
# text)
silentremove(input[:-4] + ".html")
silentremove(input[:-4] + "_ind.html")
silentremove(input[:-4] + "s.html")
i += 1
continue
else:
excps.append(str(i))
pass
pass
pass
silentremove(input[:-4] + ".html")
silentremove(input[:-4] + "_ind.html")
silentremove(input[:-4] + "s.html")
i += 1
pass
for file in os.listdir("./"):
if file.endswith(".png") or file.endswith(".jpg") or file.endswith(".jpeg"):
silentremove(file)
pass
pdb.set_trace()
And this is the file: https://drive.google.com/open?id=1vjnebt3xEuY8odhZHPwL8pf26l8ySdnE (this is just an example, I have many more that needs to be converted to single-pages)
add a comment |
After playing with several utilities from the
poppler-utils
package, I finally arrived at an acceptable, but not optimal, solution.
It turns out that detecting double-pages in PDF files is a rather tricky business. I was unable to find any library that can do so easily. So in the end, I decided to use
pdftohtml
, which is a tool from the
poppler-utils
package, to convert each page into html, then use regular expression to extract pages that are not double pages. Interestingly, I was able to get most of the cases correctly by just using one or two lines in the html file. It does not work on all cases, as there are double-pages that are marked as single-page, but it seems that there is no single-page that is marked as double-page so there is no risk of damaging the original file.
Here is what I did: I mostly depended on detecting the header number, which is in almost cases the first line of the html file (of course after several lines that are the same across all pages)
I used the fact that in the introduction of the file, the header number uses Roman numbering, so I used the corresponding regex:
if re.findall('<a name=[0-9]*></a>[XIVLCDM]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[XIVLCDM]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*<br/>', line):
Another thing I noticed is that if the line (actually the 31st line, since the first 30 lines are the same across all the pages) contains image link then it is likely not needed to be cut in half (there are cases where the left page is blank and the right page contains an image, but these are few and far between, so I just have to iterate through each page in the result and remove those that are double-page). I simply search for the string "img".
I also found out that double-pages contain the header number right at the beginning, so I simply used:
if re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[0-9]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[0-9]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>V. I. L ª - n i n &#[0-9]*;<br/>', line):
(the last line is due to some special pages that need special treatment)
In the end, it does not detect all single-pages but the good thing is that it does not wrongly consider any single-page as double-page, so suppose the result is [1, 5, 100] then I can simply iterate through the list and check visually for each case. Although this is still not completely automated but this is much much better than having to check each single page.
For those interesed, here is my code (in Python 2.7):
# -*- coding: utf-8 -*-
#!/usr/bin/python
#
import re
import pdb
import os
import errno
import subprocess
# Find pages that are not double page
# OS: Ubuntu
# Requirements: Python 2.7, pdftohtml
def silentremove(filename):
try:
os.remove(filename)
except OSError as e: # this would be "except OSError, e:" before Python 2.6
if e.errno != errno.ENOENT: # errno.ENOENT = no such file or directory
raise # re-raise exception if a different error occurred
num_of_pages = 395
input = "Lenin06.pdf"
excps =
i = 1
with open(input, 'rt') as fid:
while 1:
if i > num_of_pages:
break
if (i == 1) or (i == 2):
excps.append(str(i))
i += 1
continue
if (i == 3) or (i == 4):
i += 1
continue
cmd = "pdftohtml -i %s -f %d -l %d" % (input, i, i)
os.system(cmd)
html_file = input[:-4] + "s.html"
with open(html_file, 'rt') as html_fid:
for j in range(30):
line = html_fid.readline()
line = html_fid.readline()
line = line.strip()
if re.findall("img", line):
excps.append(str(i))
else:
if re.findall('<a name=[0-9]*></a>&#[0-9]*;<br/>', line):
excps.append(str(i))
else:
if re.findall('<a name=[0-9]*></a>[XIVLCDM]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[XIVLCDM]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*<br/>', line):
# Loi tua (Introduction)
silentremove(input[:-4] + ".html")
silentremove(input[:-4] + "_ind.html")
silentremove(input[:-4] + "s.html")
i += 1
continue
else:
if re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[0-9]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[0-9]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>V. I. L ª - n i n &#[0-9]*;<br/>', line):
# print "haha"
# Trang doi (Double page)
silentremove(input[:-4] + ".html")
silentremove(input[:-4] + "_ind.html")
silentremove(input[:-4] + "s.html")
i += 1
continue
else:
if re.findall('<a name=[0-9]*></a>[^0-9&#;]* <br/>', line) and
re.findall('^[0-9]*&#[0-9]*;<br/>$', html_fid.readline().strip()):
# 1 so truong hop trang trai trong, trang phai co chu
# (Some cases where the left page is blank while the right page contains
# text)
silentremove(input[:-4] + ".html")
silentremove(input[:-4] + "_ind.html")
silentremove(input[:-4] + "s.html")
i += 1
continue
else:
excps.append(str(i))
pass
pass
pass
silentremove(input[:-4] + ".html")
silentremove(input[:-4] + "_ind.html")
silentremove(input[:-4] + "s.html")
i += 1
pass
for file in os.listdir("./"):
if file.endswith(".png") or file.endswith(".jpg") or file.endswith(".jpeg"):
silentremove(file)
pass
pdb.set_trace()
And this is the file: https://drive.google.com/open?id=1vjnebt3xEuY8odhZHPwL8pf26l8ySdnE (this is just an example, I have many more that needs to be converted to single-pages)
After playing with several utilities from the
poppler-utils
package, I finally arrived at an acceptable, but not optimal, solution.
It turns out that detecting double-pages in PDF files is a rather tricky business. I was unable to find any library that can do so easily. So in the end, I decided to use
pdftohtml
, which is a tool from the
poppler-utils
package, to convert each page into html, then use regular expression to extract pages that are not double pages. Interestingly, I was able to get most of the cases correctly by just using one or two lines in the html file. It does not work on all cases, as there are double-pages that are marked as single-page, but it seems that there is no single-page that is marked as double-page so there is no risk of damaging the original file.
Here is what I did: I mostly depended on detecting the header number, which is in almost cases the first line of the html file (of course after several lines that are the same across all pages)
I used the fact that in the introduction of the file, the header number uses Roman numbering, so I used the corresponding regex:
if re.findall('<a name=[0-9]*></a>[XIVLCDM]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[XIVLCDM]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*<br/>', line):
Another thing I noticed is that if the line (actually the 31st line, since the first 30 lines are the same across all the pages) contains image link then it is likely not needed to be cut in half (there are cases where the left page is blank and the right page contains an image, but these are few and far between, so I just have to iterate through each page in the result and remove those that are double-page). I simply search for the string "img".
I also found out that double-pages contain the header number right at the beginning, so I simply used:
if re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[0-9]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[0-9]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>V. I. L ª - n i n &#[0-9]*;<br/>', line):
(the last line is due to some special pages that need special treatment)
In the end, it does not detect all single-pages but the good thing is that it does not wrongly consider any single-page as double-page, so suppose the result is [1, 5, 100] then I can simply iterate through the list and check visually for each case. Although this is still not completely automated but this is much much better than having to check each single page.
For those interesed, here is my code (in Python 2.7):
# -*- coding: utf-8 -*-
#!/usr/bin/python
#
import re
import pdb
import os
import errno
import subprocess
# Find pages that are not double page
# OS: Ubuntu
# Requirements: Python 2.7, pdftohtml
def silentremove(filename):
try:
os.remove(filename)
except OSError as e: # this would be "except OSError, e:" before Python 2.6
if e.errno != errno.ENOENT: # errno.ENOENT = no such file or directory
raise # re-raise exception if a different error occurred
num_of_pages = 395
input = "Lenin06.pdf"
excps =
i = 1
with open(input, 'rt') as fid:
while 1:
if i > num_of_pages:
break
if (i == 1) or (i == 2):
excps.append(str(i))
i += 1
continue
if (i == 3) or (i == 4):
i += 1
continue
cmd = "pdftohtml -i %s -f %d -l %d" % (input, i, i)
os.system(cmd)
html_file = input[:-4] + "s.html"
with open(html_file, 'rt') as html_fid:
for j in range(30):
line = html_fid.readline()
line = html_fid.readline()
line = line.strip()
if re.findall("img", line):
excps.append(str(i))
else:
if re.findall('<a name=[0-9]*></a>&#[0-9]*;<br/>', line):
excps.append(str(i))
else:
if re.findall('<a name=[0-9]*></a>[XIVLCDM]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[XIVLCDM]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*<br/>', line):
# Loi tua (Introduction)
silentremove(input[:-4] + ".html")
silentremove(input[:-4] + "_ind.html")
silentremove(input[:-4] + "s.html")
i += 1
continue
else:
if re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[0-9]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[0-9]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>V. I. L ª - n i n &#[0-9]*;<br/>', line):
# print "haha"
# Trang doi (Double page)
silentremove(input[:-4] + ".html")
silentremove(input[:-4] + "_ind.html")
silentremove(input[:-4] + "s.html")
i += 1
continue
else:
if re.findall('<a name=[0-9]*></a>[^0-9&#;]* <br/>', line) and
re.findall('^[0-9]*&#[0-9]*;<br/>$', html_fid.readline().strip()):
# 1 so truong hop trang trai trong, trang phai co chu
# (Some cases where the left page is blank while the right page contains
# text)
silentremove(input[:-4] + ".html")
silentremove(input[:-4] + "_ind.html")
silentremove(input[:-4] + "s.html")
i += 1
continue
else:
excps.append(str(i))
pass
pass
pass
silentremove(input[:-4] + ".html")
silentremove(input[:-4] + "_ind.html")
silentremove(input[:-4] + "s.html")
i += 1
pass
for file in os.listdir("./"):
if file.endswith(".png") or file.endswith(".jpg") or file.endswith(".jpeg"):
silentremove(file)
pass
pdb.set_trace()
And this is the file: https://drive.google.com/open?id=1vjnebt3xEuY8odhZHPwL8pf26l8ySdnE (this is just an example, I have many more that needs to be converted to single-pages)
answered Mar 30 at 9:27
Dang Manh TruongDang Manh Truong
51119
51119
add a comment |
add a comment |
Thanks for contributing an answer to Ask Ubuntu!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1129677%2fhow-to-detect-pages-in-pdf-files-that-are-actually-two-pages%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
You can try to play with
pdfinfo
frompoppler-utils
package (or other utils from it).– N0rbert
Mar 29 at 12:49
@N0rbert thank you very much, I found out that pdfinfo did not help much, but pdftohtml helped a lot, because I can use the html converted from the pdf as input to the regex and this helps me detect the single-pages. It is admittedly not optimal, but it is acceptable for me
– Dang Manh Truong
Mar 30 at 9:29