Extracting a specific string after a given string from HTML file using a bash script
up vote
5
down vote
favorite
I have a HTML file momcpy.html from which I want to extract a specific string after a given string.
File content is like:
<tr><br>
<th height="12" bgcolor="#808080"><label for="<br>
LSCRM:Abhijeet<br>
<br>
MCRM:Bhargav<br>
<br>
TLGAPI:GAURAVAURAV<br>
<br>
MOM:MANIKA"></td><br>
This is present on one of the lines of HTML.
I want to extract Manika and store it in a variable. So Basically I want to extract whatever string is present after MOM:, It could be dynamic.
I have tried:
file='/home/websphe/tomcat/webapps/MOM/web/momcpy.html'
y=$( awk '$1=="MOM:"{print $2}' $file)
echo "$y"
But that didn't work.
command-line bash text-processing
add a comment |
up vote
5
down vote
favorite
I have a HTML file momcpy.html from which I want to extract a specific string after a given string.
File content is like:
<tr><br>
<th height="12" bgcolor="#808080"><label for="<br>
LSCRM:Abhijeet<br>
<br>
MCRM:Bhargav<br>
<br>
TLGAPI:GAURAVAURAV<br>
<br>
MOM:MANIKA"></td><br>
This is present on one of the lines of HTML.
I want to extract Manika and store it in a variable. So Basically I want to extract whatever string is present after MOM:, It could be dynamic.
I have tried:
file='/home/websphe/tomcat/webapps/MOM/web/momcpy.html'
y=$( awk '$1=="MOM:"{print $2}' $file)
echo "$y"
But that didn't work.
command-line bash text-processing
Here i want to extract string after "MOM:"<tr><br> <th height="12" bgcolor="#808080"><label for="<br> LSCRM:Abhijeet<br> <br> MCRM:Bhargav<br> <br> TLGAPI:GAURAVAURAV<br> <br> MOM:MANIKA">Agenda:</label></th><br>
– Abhijeet Anand
Sep 3 '17 at 19:33
edited question
– Abhijeet Anand
Sep 3 '17 at 19:38
3
This document is not well-formed HTML and thus not actually HTML. Element attributes may not contain unescaped<or>and theforattribute of thelabelelement may only contain IDs of other (form) elements. It would also help if you included the full element tree from the root down to the element in question so that one may build a solution based on an actual (X)HTML parser.
– David Foerster
Sep 3 '17 at 21:27
add a comment |
up vote
5
down vote
favorite
up vote
5
down vote
favorite
I have a HTML file momcpy.html from which I want to extract a specific string after a given string.
File content is like:
<tr><br>
<th height="12" bgcolor="#808080"><label for="<br>
LSCRM:Abhijeet<br>
<br>
MCRM:Bhargav<br>
<br>
TLGAPI:GAURAVAURAV<br>
<br>
MOM:MANIKA"></td><br>
This is present on one of the lines of HTML.
I want to extract Manika and store it in a variable. So Basically I want to extract whatever string is present after MOM:, It could be dynamic.
I have tried:
file='/home/websphe/tomcat/webapps/MOM/web/momcpy.html'
y=$( awk '$1=="MOM:"{print $2}' $file)
echo "$y"
But that didn't work.
command-line bash text-processing
I have a HTML file momcpy.html from which I want to extract a specific string after a given string.
File content is like:
<tr><br>
<th height="12" bgcolor="#808080"><label for="<br>
LSCRM:Abhijeet<br>
<br>
MCRM:Bhargav<br>
<br>
TLGAPI:GAURAVAURAV<br>
<br>
MOM:MANIKA"></td><br>
This is present on one of the lines of HTML.
I want to extract Manika and store it in a variable. So Basically I want to extract whatever string is present after MOM:, It could be dynamic.
I have tried:
file='/home/websphe/tomcat/webapps/MOM/web/momcpy.html'
y=$( awk '$1=="MOM:"{print $2}' $file)
echo "$y"
But that didn't work.
command-line bash text-processing
command-line bash text-processing
edited Sep 3 '17 at 21:13
asked Sep 3 '17 at 19:00
Abhijeet Anand
4315
4315
Here i want to extract string after "MOM:"<tr><br> <th height="12" bgcolor="#808080"><label for="<br> LSCRM:Abhijeet<br> <br> MCRM:Bhargav<br> <br> TLGAPI:GAURAVAURAV<br> <br> MOM:MANIKA">Agenda:</label></th><br>
– Abhijeet Anand
Sep 3 '17 at 19:33
edited question
– Abhijeet Anand
Sep 3 '17 at 19:38
3
This document is not well-formed HTML and thus not actually HTML. Element attributes may not contain unescaped<or>and theforattribute of thelabelelement may only contain IDs of other (form) elements. It would also help if you included the full element tree from the root down to the element in question so that one may build a solution based on an actual (X)HTML parser.
– David Foerster
Sep 3 '17 at 21:27
add a comment |
Here i want to extract string after "MOM:"<tr><br> <th height="12" bgcolor="#808080"><label for="<br> LSCRM:Abhijeet<br> <br> MCRM:Bhargav<br> <br> TLGAPI:GAURAVAURAV<br> <br> MOM:MANIKA">Agenda:</label></th><br>
– Abhijeet Anand
Sep 3 '17 at 19:33
edited question
– Abhijeet Anand
Sep 3 '17 at 19:38
3
This document is not well-formed HTML and thus not actually HTML. Element attributes may not contain unescaped<or>and theforattribute of thelabelelement may only contain IDs of other (form) elements. It would also help if you included the full element tree from the root down to the element in question so that one may build a solution based on an actual (X)HTML parser.
– David Foerster
Sep 3 '17 at 21:27
Here i want to extract string after "MOM:"<tr><br> <th height="12" bgcolor="#808080"><label for="<br> LSCRM:Abhijeet<br> <br> MCRM:Bhargav<br> <br> TLGAPI:GAURAVAURAV<br> <br> MOM:MANIKA">Agenda:</label></th><br>
– Abhijeet Anand
Sep 3 '17 at 19:33
Here i want to extract string after "MOM:"<tr><br> <th height="12" bgcolor="#808080"><label for="<br> LSCRM:Abhijeet<br> <br> MCRM:Bhargav<br> <br> TLGAPI:GAURAVAURAV<br> <br> MOM:MANIKA">Agenda:</label></th><br>
– Abhijeet Anand
Sep 3 '17 at 19:33
edited question
– Abhijeet Anand
Sep 3 '17 at 19:38
edited question
– Abhijeet Anand
Sep 3 '17 at 19:38
3
3
This document is not well-formed HTML and thus not actually HTML. Element attributes may not contain unescaped
< or > and the for attribute of the label element may only contain IDs of other (form) elements. It would also help if you included the full element tree from the root down to the element in question so that one may build a solution based on an actual (X)HTML parser.– David Foerster
Sep 3 '17 at 21:27
This document is not well-formed HTML and thus not actually HTML. Element attributes may not contain unescaped
< or > and the for attribute of the label element may only contain IDs of other (form) elements. It would also help if you included the full element tree from the root down to the element in question so that one may build a solution based on an actual (X)HTML parser.– David Foerster
Sep 3 '17 at 21:27
add a comment |
3 Answers
3
active
oldest
votes
up vote
5
down vote
accepted
I can't sensibly advise doing this, because parsing html with regex is not likely to end well but you might be able to get the string MANIKA with
sed -nr '/MOM:/ s/.*MOM:([^"]+).*/1/p' file
It works OK on your sample anyway...
Notes
-ndon't print anything until we ask for it
-ruse ERE
/string/find lines withstring
s/old/new/replaceoldwithnew
.*any number of any characters
([^"]+)save some characters that are not"
1backreference to saved characters
pprint just the lines we changed
Thats really appreciable, but i just want string Manika, here m getting mANIKAnmANIKA</td><br>
– Abhijeet Anand
Sep 3 '17 at 21:10
working fine, ty :)
– Abhijeet Anand
Sep 3 '17 at 21:20
+1 and I agree about HTML caution. However this answer is great for other applications.
– WinEunuuchs2Unix
Nov 18 at 17:45
add a comment |
up vote
4
down vote
grep -Po 'MOM:K[^"]+' file.html
Warning: this is not a very robust solution; And your HTML is not valid
add a comment |
up vote
1
down vote
The string you're looking for always has MOM: before it, but you have not said if it always has " after it. For the purpose of this answer I will assume that you are looking for strings that are permitted to contain any lower or upper case alphabetic characters, numerals, or underscores. These are known as word characters in the terminology of regular expressions. Matching such "words" of text is useful enough that most dialects of regular expressions have features to help do so. If this isn't what you want, you can modify this solution accordingly or you can use the techniques in the other answers.
I echo David Foerster's, Zanna's, and JJoao's wise warnings about parsing HTML with regex and about this not being robust. Please be careful, and consider if what you have requested is really exactly what you want to do. In your example code you assigned the path to the input file to the variable $file, so I will assume this has been done. You've assigned the output of your command to $y, so I will do the same.
With grep
This is similar to JJoao's method, and you can use that method with command substitution as well if the regular expression there is more suited to your needs.
y="$(grep -oPm1 'MOM:Kw+' "$file")"
-oPm1 is just a more compact way to write -o -P -m 1.
-oprints only the matches, not the whole line.
-Puses PCRE, which supportsKto drop text matched so far so it's not included in the matched text that is returned.
-m 1stops after matching the pattern one time. This way, you assign just the first match to the variable rather than multiple matches separated by newlines.
Note that you can also add -m1 to the command in JJoao's answer so it uses only matches from the first line that has any.
If the first line with a match contains multiple matches, this grep method gives you all of them. For example, if that line is MOM:MANIKA MOM:JANE"></td><br> then $y will hold the value:
MANIKA
JANE
With sed
This resembles Zanna's method.
y="$(sed -rn '0,/.*MOM:(w+).*/ s//1/p' "$file")"
Besides being enclosed as a command substitution, the differences are that I:
- stop after the first line that contains a match
- match one or more word characters (
w+) instead of characters up to a"([^"]+) - consume zero or more arbitrary characters (
.*) first, so thatMOM:doesn't have to appear at the very beginning of the line - use a more compact syntax that avoids writing the pattern twice.
The technique I used for this requires GNU sed, but that's the sed implementation provided in Ubuntu.
If the first line with a match contains multiple matches, this sed method gives you just the last one. From MOM:MANIKA MOM:JANE"></td><br> you get:
JANE
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "89"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f952467%2fextracting-a-specific-string-after-a-given-string-from-html-file-using-a-bash-sc%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
5
down vote
accepted
I can't sensibly advise doing this, because parsing html with regex is not likely to end well but you might be able to get the string MANIKA with
sed -nr '/MOM:/ s/.*MOM:([^"]+).*/1/p' file
It works OK on your sample anyway...
Notes
-ndon't print anything until we ask for it
-ruse ERE
/string/find lines withstring
s/old/new/replaceoldwithnew
.*any number of any characters
([^"]+)save some characters that are not"
1backreference to saved characters
pprint just the lines we changed
Thats really appreciable, but i just want string Manika, here m getting mANIKAnmANIKA</td><br>
– Abhijeet Anand
Sep 3 '17 at 21:10
working fine, ty :)
– Abhijeet Anand
Sep 3 '17 at 21:20
+1 and I agree about HTML caution. However this answer is great for other applications.
– WinEunuuchs2Unix
Nov 18 at 17:45
add a comment |
up vote
5
down vote
accepted
I can't sensibly advise doing this, because parsing html with regex is not likely to end well but you might be able to get the string MANIKA with
sed -nr '/MOM:/ s/.*MOM:([^"]+).*/1/p' file
It works OK on your sample anyway...
Notes
-ndon't print anything until we ask for it
-ruse ERE
/string/find lines withstring
s/old/new/replaceoldwithnew
.*any number of any characters
([^"]+)save some characters that are not"
1backreference to saved characters
pprint just the lines we changed
Thats really appreciable, but i just want string Manika, here m getting mANIKAnmANIKA</td><br>
– Abhijeet Anand
Sep 3 '17 at 21:10
working fine, ty :)
– Abhijeet Anand
Sep 3 '17 at 21:20
+1 and I agree about HTML caution. However this answer is great for other applications.
– WinEunuuchs2Unix
Nov 18 at 17:45
add a comment |
up vote
5
down vote
accepted
up vote
5
down vote
accepted
I can't sensibly advise doing this, because parsing html with regex is not likely to end well but you might be able to get the string MANIKA with
sed -nr '/MOM:/ s/.*MOM:([^"]+).*/1/p' file
It works OK on your sample anyway...
Notes
-ndon't print anything until we ask for it
-ruse ERE
/string/find lines withstring
s/old/new/replaceoldwithnew
.*any number of any characters
([^"]+)save some characters that are not"
1backreference to saved characters
pprint just the lines we changed
I can't sensibly advise doing this, because parsing html with regex is not likely to end well but you might be able to get the string MANIKA with
sed -nr '/MOM:/ s/.*MOM:([^"]+).*/1/p' file
It works OK on your sample anyway...
Notes
-ndon't print anything until we ask for it
-ruse ERE
/string/find lines withstring
s/old/new/replaceoldwithnew
.*any number of any characters
([^"]+)save some characters that are not"
1backreference to saved characters
pprint just the lines we changed
answered Sep 3 '17 at 19:41
Zanna
49.6k13128237
49.6k13128237
Thats really appreciable, but i just want string Manika, here m getting mANIKAnmANIKA</td><br>
– Abhijeet Anand
Sep 3 '17 at 21:10
working fine, ty :)
– Abhijeet Anand
Sep 3 '17 at 21:20
+1 and I agree about HTML caution. However this answer is great for other applications.
– WinEunuuchs2Unix
Nov 18 at 17:45
add a comment |
Thats really appreciable, but i just want string Manika, here m getting mANIKAnmANIKA</td><br>
– Abhijeet Anand
Sep 3 '17 at 21:10
working fine, ty :)
– Abhijeet Anand
Sep 3 '17 at 21:20
+1 and I agree about HTML caution. However this answer is great for other applications.
– WinEunuuchs2Unix
Nov 18 at 17:45
Thats really appreciable, but i just want string Manika, here m getting mANIKAnmANIKA</td><br>
– Abhijeet Anand
Sep 3 '17 at 21:10
Thats really appreciable, but i just want string Manika, here m getting mANIKAnmANIKA</td><br>
– Abhijeet Anand
Sep 3 '17 at 21:10
working fine, ty :)
– Abhijeet Anand
Sep 3 '17 at 21:20
working fine, ty :)
– Abhijeet Anand
Sep 3 '17 at 21:20
+1 and I agree about HTML caution. However this answer is great for other applications.
– WinEunuuchs2Unix
Nov 18 at 17:45
+1 and I agree about HTML caution. However this answer is great for other applications.
– WinEunuuchs2Unix
Nov 18 at 17:45
add a comment |
up vote
4
down vote
grep -Po 'MOM:K[^"]+' file.html
Warning: this is not a very robust solution; And your HTML is not valid
add a comment |
up vote
4
down vote
grep -Po 'MOM:K[^"]+' file.html
Warning: this is not a very robust solution; And your HTML is not valid
add a comment |
up vote
4
down vote
up vote
4
down vote
grep -Po 'MOM:K[^"]+' file.html
Warning: this is not a very robust solution; And your HTML is not valid
grep -Po 'MOM:K[^"]+' file.html
Warning: this is not a very robust solution; And your HTML is not valid
edited Dec 12 at 16:18
answered Sep 5 '17 at 16:12
JJoao
1,38069
1,38069
add a comment |
add a comment |
up vote
1
down vote
The string you're looking for always has MOM: before it, but you have not said if it always has " after it. For the purpose of this answer I will assume that you are looking for strings that are permitted to contain any lower or upper case alphabetic characters, numerals, or underscores. These are known as word characters in the terminology of regular expressions. Matching such "words" of text is useful enough that most dialects of regular expressions have features to help do so. If this isn't what you want, you can modify this solution accordingly or you can use the techniques in the other answers.
I echo David Foerster's, Zanna's, and JJoao's wise warnings about parsing HTML with regex and about this not being robust. Please be careful, and consider if what you have requested is really exactly what you want to do. In your example code you assigned the path to the input file to the variable $file, so I will assume this has been done. You've assigned the output of your command to $y, so I will do the same.
With grep
This is similar to JJoao's method, and you can use that method with command substitution as well if the regular expression there is more suited to your needs.
y="$(grep -oPm1 'MOM:Kw+' "$file")"
-oPm1 is just a more compact way to write -o -P -m 1.
-oprints only the matches, not the whole line.
-Puses PCRE, which supportsKto drop text matched so far so it's not included in the matched text that is returned.
-m 1stops after matching the pattern one time. This way, you assign just the first match to the variable rather than multiple matches separated by newlines.
Note that you can also add -m1 to the command in JJoao's answer so it uses only matches from the first line that has any.
If the first line with a match contains multiple matches, this grep method gives you all of them. For example, if that line is MOM:MANIKA MOM:JANE"></td><br> then $y will hold the value:
MANIKA
JANE
With sed
This resembles Zanna's method.
y="$(sed -rn '0,/.*MOM:(w+).*/ s//1/p' "$file")"
Besides being enclosed as a command substitution, the differences are that I:
- stop after the first line that contains a match
- match one or more word characters (
w+) instead of characters up to a"([^"]+) - consume zero or more arbitrary characters (
.*) first, so thatMOM:doesn't have to appear at the very beginning of the line - use a more compact syntax that avoids writing the pattern twice.
The technique I used for this requires GNU sed, but that's the sed implementation provided in Ubuntu.
If the first line with a match contains multiple matches, this sed method gives you just the last one. From MOM:MANIKA MOM:JANE"></td><br> you get:
JANE
add a comment |
up vote
1
down vote
The string you're looking for always has MOM: before it, but you have not said if it always has " after it. For the purpose of this answer I will assume that you are looking for strings that are permitted to contain any lower or upper case alphabetic characters, numerals, or underscores. These are known as word characters in the terminology of regular expressions. Matching such "words" of text is useful enough that most dialects of regular expressions have features to help do so. If this isn't what you want, you can modify this solution accordingly or you can use the techniques in the other answers.
I echo David Foerster's, Zanna's, and JJoao's wise warnings about parsing HTML with regex and about this not being robust. Please be careful, and consider if what you have requested is really exactly what you want to do. In your example code you assigned the path to the input file to the variable $file, so I will assume this has been done. You've assigned the output of your command to $y, so I will do the same.
With grep
This is similar to JJoao's method, and you can use that method with command substitution as well if the regular expression there is more suited to your needs.
y="$(grep -oPm1 'MOM:Kw+' "$file")"
-oPm1 is just a more compact way to write -o -P -m 1.
-oprints only the matches, not the whole line.
-Puses PCRE, which supportsKto drop text matched so far so it's not included in the matched text that is returned.
-m 1stops after matching the pattern one time. This way, you assign just the first match to the variable rather than multiple matches separated by newlines.
Note that you can also add -m1 to the command in JJoao's answer so it uses only matches from the first line that has any.
If the first line with a match contains multiple matches, this grep method gives you all of them. For example, if that line is MOM:MANIKA MOM:JANE"></td><br> then $y will hold the value:
MANIKA
JANE
With sed
This resembles Zanna's method.
y="$(sed -rn '0,/.*MOM:(w+).*/ s//1/p' "$file")"
Besides being enclosed as a command substitution, the differences are that I:
- stop after the first line that contains a match
- match one or more word characters (
w+) instead of characters up to a"([^"]+) - consume zero or more arbitrary characters (
.*) first, so thatMOM:doesn't have to appear at the very beginning of the line - use a more compact syntax that avoids writing the pattern twice.
The technique I used for this requires GNU sed, but that's the sed implementation provided in Ubuntu.
If the first line with a match contains multiple matches, this sed method gives you just the last one. From MOM:MANIKA MOM:JANE"></td><br> you get:
JANE
add a comment |
up vote
1
down vote
up vote
1
down vote
The string you're looking for always has MOM: before it, but you have not said if it always has " after it. For the purpose of this answer I will assume that you are looking for strings that are permitted to contain any lower or upper case alphabetic characters, numerals, or underscores. These are known as word characters in the terminology of regular expressions. Matching such "words" of text is useful enough that most dialects of regular expressions have features to help do so. If this isn't what you want, you can modify this solution accordingly or you can use the techniques in the other answers.
I echo David Foerster's, Zanna's, and JJoao's wise warnings about parsing HTML with regex and about this not being robust. Please be careful, and consider if what you have requested is really exactly what you want to do. In your example code you assigned the path to the input file to the variable $file, so I will assume this has been done. You've assigned the output of your command to $y, so I will do the same.
With grep
This is similar to JJoao's method, and you can use that method with command substitution as well if the regular expression there is more suited to your needs.
y="$(grep -oPm1 'MOM:Kw+' "$file")"
-oPm1 is just a more compact way to write -o -P -m 1.
-oprints only the matches, not the whole line.
-Puses PCRE, which supportsKto drop text matched so far so it's not included in the matched text that is returned.
-m 1stops after matching the pattern one time. This way, you assign just the first match to the variable rather than multiple matches separated by newlines.
Note that you can also add -m1 to the command in JJoao's answer so it uses only matches from the first line that has any.
If the first line with a match contains multiple matches, this grep method gives you all of them. For example, if that line is MOM:MANIKA MOM:JANE"></td><br> then $y will hold the value:
MANIKA
JANE
With sed
This resembles Zanna's method.
y="$(sed -rn '0,/.*MOM:(w+).*/ s//1/p' "$file")"
Besides being enclosed as a command substitution, the differences are that I:
- stop after the first line that contains a match
- match one or more word characters (
w+) instead of characters up to a"([^"]+) - consume zero or more arbitrary characters (
.*) first, so thatMOM:doesn't have to appear at the very beginning of the line - use a more compact syntax that avoids writing the pattern twice.
The technique I used for this requires GNU sed, but that's the sed implementation provided in Ubuntu.
If the first line with a match contains multiple matches, this sed method gives you just the last one. From MOM:MANIKA MOM:JANE"></td><br> you get:
JANE
The string you're looking for always has MOM: before it, but you have not said if it always has " after it. For the purpose of this answer I will assume that you are looking for strings that are permitted to contain any lower or upper case alphabetic characters, numerals, or underscores. These are known as word characters in the terminology of regular expressions. Matching such "words" of text is useful enough that most dialects of regular expressions have features to help do so. If this isn't what you want, you can modify this solution accordingly or you can use the techniques in the other answers.
I echo David Foerster's, Zanna's, and JJoao's wise warnings about parsing HTML with regex and about this not being robust. Please be careful, and consider if what you have requested is really exactly what you want to do. In your example code you assigned the path to the input file to the variable $file, so I will assume this has been done. You've assigned the output of your command to $y, so I will do the same.
With grep
This is similar to JJoao's method, and you can use that method with command substitution as well if the regular expression there is more suited to your needs.
y="$(grep -oPm1 'MOM:Kw+' "$file")"
-oPm1 is just a more compact way to write -o -P -m 1.
-oprints only the matches, not the whole line.
-Puses PCRE, which supportsKto drop text matched so far so it's not included in the matched text that is returned.
-m 1stops after matching the pattern one time. This way, you assign just the first match to the variable rather than multiple matches separated by newlines.
Note that you can also add -m1 to the command in JJoao's answer so it uses only matches from the first line that has any.
If the first line with a match contains multiple matches, this grep method gives you all of them. For example, if that line is MOM:MANIKA MOM:JANE"></td><br> then $y will hold the value:
MANIKA
JANE
With sed
This resembles Zanna's method.
y="$(sed -rn '0,/.*MOM:(w+).*/ s//1/p' "$file")"
Besides being enclosed as a command substitution, the differences are that I:
- stop after the first line that contains a match
- match one or more word characters (
w+) instead of characters up to a"([^"]+) - consume zero or more arbitrary characters (
.*) first, so thatMOM:doesn't have to appear at the very beginning of the line - use a more compact syntax that avoids writing the pattern twice.
The technique I used for this requires GNU sed, but that's the sed implementation provided in Ubuntu.
If the first line with a match contains multiple matches, this sed method gives you just the last one. From MOM:MANIKA MOM:JANE"></td><br> you get:
JANE
answered Sep 19 '17 at 23:25
Eliah Kagan
81.1k20227364
81.1k20227364
add a comment |
add a comment |
Thanks for contributing an answer to Ask Ubuntu!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f952467%2fextracting-a-specific-string-after-a-given-string-from-html-file-using-a-bash-sc%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Here i want to extract string after "MOM:"<tr><br> <th height="12" bgcolor="#808080"><label for="<br> LSCRM:Abhijeet<br> <br> MCRM:Bhargav<br> <br> TLGAPI:GAURAVAURAV<br> <br> MOM:MANIKA">Agenda:</label></th><br>
– Abhijeet Anand
Sep 3 '17 at 19:33
edited question
– Abhijeet Anand
Sep 3 '17 at 19:38
3
This document is not well-formed HTML and thus not actually HTML. Element attributes may not contain unescaped
<or>and theforattribute of thelabelelement may only contain IDs of other (form) elements. It would also help if you included the full element tree from the root down to the element in question so that one may build a solution based on an actual (X)HTML parser.– David Foerster
Sep 3 '17 at 21:27