Extracting a specific string after a given string from HTML file using a bash script











up vote
5
down vote

favorite












I have a HTML file momcpy.html from which I want to extract a specific string after a given string.
File content is like:



<tr><br>
<th height="12" bgcolor="#808080"><label for="<br>
LSCRM:Abhijeet<br>
<br>
MCRM:Bhargav<br>
<br>
TLGAPI:GAURAVAURAV<br>
<br>
MOM:MANIKA"></td><br>


This is present on one of the lines of HTML.



I want to extract Manika and store it in a variable. So Basically I want to extract whatever string is present after MOM:, It could be dynamic.



I have tried:



file='/home/websphe/tomcat/webapps/MOM/web/momcpy.html'
y=$( awk '$1=="MOM:"{print $2}' $file)
echo "$y"


But that didn't work.










share|improve this question
























  • Here i want to extract string after "MOM:"<tr><br> <th height="12" bgcolor="#808080"><label for="<br> LSCRM:Abhijeet<br> <br> MCRM:Bhargav<br> <br> TLGAPI:GAURAVAURAV<br> <br> MOM:MANIKA">Agenda:</label></th><br>
    – Abhijeet Anand
    Sep 3 '17 at 19:33












  • edited question
    – Abhijeet Anand
    Sep 3 '17 at 19:38






  • 3




    This document is not well-formed HTML and thus not actually HTML. Element attributes may not contain unescaped < or > and the for attribute of the label element may only contain IDs of other (form) elements. It would also help if you included the full element tree from the root down to the element in question so that one may build a solution based on an actual (X)HTML parser.
    – David Foerster
    Sep 3 '17 at 21:27

















up vote
5
down vote

favorite












I have a HTML file momcpy.html from which I want to extract a specific string after a given string.
File content is like:



<tr><br>
<th height="12" bgcolor="#808080"><label for="<br>
LSCRM:Abhijeet<br>
<br>
MCRM:Bhargav<br>
<br>
TLGAPI:GAURAVAURAV<br>
<br>
MOM:MANIKA"></td><br>


This is present on one of the lines of HTML.



I want to extract Manika and store it in a variable. So Basically I want to extract whatever string is present after MOM:, It could be dynamic.



I have tried:



file='/home/websphe/tomcat/webapps/MOM/web/momcpy.html'
y=$( awk '$1=="MOM:"{print $2}' $file)
echo "$y"


But that didn't work.










share|improve this question
























  • Here i want to extract string after "MOM:"<tr><br> <th height="12" bgcolor="#808080"><label for="<br> LSCRM:Abhijeet<br> <br> MCRM:Bhargav<br> <br> TLGAPI:GAURAVAURAV<br> <br> MOM:MANIKA">Agenda:</label></th><br>
    – Abhijeet Anand
    Sep 3 '17 at 19:33












  • edited question
    – Abhijeet Anand
    Sep 3 '17 at 19:38






  • 3




    This document is not well-formed HTML and thus not actually HTML. Element attributes may not contain unescaped < or > and the for attribute of the label element may only contain IDs of other (form) elements. It would also help if you included the full element tree from the root down to the element in question so that one may build a solution based on an actual (X)HTML parser.
    – David Foerster
    Sep 3 '17 at 21:27















up vote
5
down vote

favorite









up vote
5
down vote

favorite











I have a HTML file momcpy.html from which I want to extract a specific string after a given string.
File content is like:



<tr><br>
<th height="12" bgcolor="#808080"><label for="<br>
LSCRM:Abhijeet<br>
<br>
MCRM:Bhargav<br>
<br>
TLGAPI:GAURAVAURAV<br>
<br>
MOM:MANIKA"></td><br>


This is present on one of the lines of HTML.



I want to extract Manika and store it in a variable. So Basically I want to extract whatever string is present after MOM:, It could be dynamic.



I have tried:



file='/home/websphe/tomcat/webapps/MOM/web/momcpy.html'
y=$( awk '$1=="MOM:"{print $2}' $file)
echo "$y"


But that didn't work.










share|improve this question















I have a HTML file momcpy.html from which I want to extract a specific string after a given string.
File content is like:



<tr><br>
<th height="12" bgcolor="#808080"><label for="<br>
LSCRM:Abhijeet<br>
<br>
MCRM:Bhargav<br>
<br>
TLGAPI:GAURAVAURAV<br>
<br>
MOM:MANIKA"></td><br>


This is present on one of the lines of HTML.



I want to extract Manika and store it in a variable. So Basically I want to extract whatever string is present after MOM:, It could be dynamic.



I have tried:



file='/home/websphe/tomcat/webapps/MOM/web/momcpy.html'
y=$( awk '$1=="MOM:"{print $2}' $file)
echo "$y"


But that didn't work.







command-line bash text-processing






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Sep 3 '17 at 21:13

























asked Sep 3 '17 at 19:00









Abhijeet Anand

4315




4315












  • Here i want to extract string after "MOM:"<tr><br> <th height="12" bgcolor="#808080"><label for="<br> LSCRM:Abhijeet<br> <br> MCRM:Bhargav<br> <br> TLGAPI:GAURAVAURAV<br> <br> MOM:MANIKA">Agenda:</label></th><br>
    – Abhijeet Anand
    Sep 3 '17 at 19:33












  • edited question
    – Abhijeet Anand
    Sep 3 '17 at 19:38






  • 3




    This document is not well-formed HTML and thus not actually HTML. Element attributes may not contain unescaped < or > and the for attribute of the label element may only contain IDs of other (form) elements. It would also help if you included the full element tree from the root down to the element in question so that one may build a solution based on an actual (X)HTML parser.
    – David Foerster
    Sep 3 '17 at 21:27




















  • Here i want to extract string after "MOM:"<tr><br> <th height="12" bgcolor="#808080"><label for="<br> LSCRM:Abhijeet<br> <br> MCRM:Bhargav<br> <br> TLGAPI:GAURAVAURAV<br> <br> MOM:MANIKA">Agenda:</label></th><br>
    – Abhijeet Anand
    Sep 3 '17 at 19:33












  • edited question
    – Abhijeet Anand
    Sep 3 '17 at 19:38






  • 3




    This document is not well-formed HTML and thus not actually HTML. Element attributes may not contain unescaped < or > and the for attribute of the label element may only contain IDs of other (form) elements. It would also help if you included the full element tree from the root down to the element in question so that one may build a solution based on an actual (X)HTML parser.
    – David Foerster
    Sep 3 '17 at 21:27


















Here i want to extract string after "MOM:"<tr><br> <th height="12" bgcolor="#808080"><label for="<br> LSCRM:Abhijeet<br> <br> MCRM:Bhargav<br> <br> TLGAPI:GAURAVAURAV<br> <br> MOM:MANIKA">Agenda:</label></th><br>
– Abhijeet Anand
Sep 3 '17 at 19:33






Here i want to extract string after "MOM:"<tr><br> <th height="12" bgcolor="#808080"><label for="<br> LSCRM:Abhijeet<br> <br> MCRM:Bhargav<br> <br> TLGAPI:GAURAVAURAV<br> <br> MOM:MANIKA">Agenda:</label></th><br>
– Abhijeet Anand
Sep 3 '17 at 19:33














edited question
– Abhijeet Anand
Sep 3 '17 at 19:38




edited question
– Abhijeet Anand
Sep 3 '17 at 19:38




3




3




This document is not well-formed HTML and thus not actually HTML. Element attributes may not contain unescaped < or > and the for attribute of the label element may only contain IDs of other (form) elements. It would also help if you included the full element tree from the root down to the element in question so that one may build a solution based on an actual (X)HTML parser.
– David Foerster
Sep 3 '17 at 21:27






This document is not well-formed HTML and thus not actually HTML. Element attributes may not contain unescaped < or > and the for attribute of the label element may only contain IDs of other (form) elements. It would also help if you included the full element tree from the root down to the element in question so that one may build a solution based on an actual (X)HTML parser.
– David Foerster
Sep 3 '17 at 21:27












3 Answers
3






active

oldest

votes

















up vote
5
down vote



accepted










I can't sensibly advise doing this, because parsing html with regex is not likely to end well but you might be able to get the string MANIKA with



sed -nr '/MOM:/ s/.*MOM:([^"]+).*/1/p' file


It works OK on your sample anyway...



Notes





  • -n don't print anything until we ask for it


  • -r use ERE


  • /string/ find lines with string


  • s/old/new/ replace old with new


  • .* any number of any characters


  • ([^"]+) save some characters that are not "


  • 1 backreference to saved characters


  • p print just the lines we changed






share|improve this answer





















  • Thats really appreciable, but i just want string Manika, here m getting mANIKAnmANIKA</td><br>
    – Abhijeet Anand
    Sep 3 '17 at 21:10










  • working fine, ty :)
    – Abhijeet Anand
    Sep 3 '17 at 21:20










  • +1 and I agree about HTML caution. However this answer is great for other applications.
    – WinEunuuchs2Unix
    Nov 18 at 17:45


















up vote
4
down vote













grep -Po 'MOM:K[^"]+' file.html


Warning: this is not a very robust solution; And your HTML is not valid






share|improve this answer






























    up vote
    1
    down vote













    The string you're looking for always has MOM: before it, but you have not said if it always has " after it. For the purpose of this answer I will assume that you are looking for strings that are permitted to contain any lower or upper case alphabetic characters, numerals, or underscores. These are known as word characters in the terminology of regular expressions. Matching such "words" of text is useful enough that most dialects of regular expressions have features to help do so. If this isn't what you want, you can modify this solution accordingly or you can use the techniques in the other answers.



    I echo David Foerster's, Zanna's, and JJoao's wise warnings about parsing HTML with regex and about this not being robust. Please be careful, and consider if what you have requested is really exactly what you want to do. In your example code you assigned the path to the input file to the variable $file, so I will assume this has been done. You've assigned the output of your command to $y, so I will do the same.



    With grep



    This is similar to JJoao's method, and you can use that method with command substitution as well if the regular expression there is more suited to your needs.



    y="$(grep -oPm1 'MOM:Kw+' "$file")"


    -oPm1 is just a more compact way to write -o -P -m 1.





    • -o prints only the matches, not the whole line.


    • -P uses PCRE, which supports K to drop text matched so far so it's not included in the matched text that is returned.


    • -m 1 stops after matching the pattern one time. This way, you assign just the first match to the variable rather than multiple matches separated by newlines.


    Note that you can also add -m1 to the command in JJoao's answer so it uses only matches from the first line that has any.



    If the first line with a match contains multiple matches, this grep method gives you all of them. For example, if that line is MOM:MANIKA MOM:JANE"></td><br> then $y will hold the value:



    MANIKA
    JANE


    With sed



    This resembles Zanna's method.



    y="$(sed -rn '0,/.*MOM:(w+).*/ s//1/p' "$file")"


    Besides being enclosed as a command substitution, the differences are that I:




    • stop after the first line that contains a match

    • match one or more word characters (w+) instead of characters up to a " ([^"]+)

    • consume zero or more arbitrary characters (.*) first, so that MOM: doesn't have to appear at the very beginning of the line

    • use a more compact syntax that avoids writing the pattern twice.


    The technique I used for this requires GNU sed, but that's the sed implementation provided in Ubuntu.



    If the first line with a match contains multiple matches, this sed method gives you just the last one. From MOM:MANIKA MOM:JANE"></td><br> you get:



    JANE





    share|improve this answer





















      Your Answer








      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "89"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f952467%2fextracting-a-specific-string-after-a-given-string-from-html-file-using-a-bash-sc%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      5
      down vote



      accepted










      I can't sensibly advise doing this, because parsing html with regex is not likely to end well but you might be able to get the string MANIKA with



      sed -nr '/MOM:/ s/.*MOM:([^"]+).*/1/p' file


      It works OK on your sample anyway...



      Notes





      • -n don't print anything until we ask for it


      • -r use ERE


      • /string/ find lines with string


      • s/old/new/ replace old with new


      • .* any number of any characters


      • ([^"]+) save some characters that are not "


      • 1 backreference to saved characters


      • p print just the lines we changed






      share|improve this answer





















      • Thats really appreciable, but i just want string Manika, here m getting mANIKAnmANIKA</td><br>
        – Abhijeet Anand
        Sep 3 '17 at 21:10










      • working fine, ty :)
        – Abhijeet Anand
        Sep 3 '17 at 21:20










      • +1 and I agree about HTML caution. However this answer is great for other applications.
        – WinEunuuchs2Unix
        Nov 18 at 17:45















      up vote
      5
      down vote



      accepted










      I can't sensibly advise doing this, because parsing html with regex is not likely to end well but you might be able to get the string MANIKA with



      sed -nr '/MOM:/ s/.*MOM:([^"]+).*/1/p' file


      It works OK on your sample anyway...



      Notes





      • -n don't print anything until we ask for it


      • -r use ERE


      • /string/ find lines with string


      • s/old/new/ replace old with new


      • .* any number of any characters


      • ([^"]+) save some characters that are not "


      • 1 backreference to saved characters


      • p print just the lines we changed






      share|improve this answer





















      • Thats really appreciable, but i just want string Manika, here m getting mANIKAnmANIKA</td><br>
        – Abhijeet Anand
        Sep 3 '17 at 21:10










      • working fine, ty :)
        – Abhijeet Anand
        Sep 3 '17 at 21:20










      • +1 and I agree about HTML caution. However this answer is great for other applications.
        – WinEunuuchs2Unix
        Nov 18 at 17:45













      up vote
      5
      down vote



      accepted







      up vote
      5
      down vote



      accepted






      I can't sensibly advise doing this, because parsing html with regex is not likely to end well but you might be able to get the string MANIKA with



      sed -nr '/MOM:/ s/.*MOM:([^"]+).*/1/p' file


      It works OK on your sample anyway...



      Notes





      • -n don't print anything until we ask for it


      • -r use ERE


      • /string/ find lines with string


      • s/old/new/ replace old with new


      • .* any number of any characters


      • ([^"]+) save some characters that are not "


      • 1 backreference to saved characters


      • p print just the lines we changed






      share|improve this answer












      I can't sensibly advise doing this, because parsing html with regex is not likely to end well but you might be able to get the string MANIKA with



      sed -nr '/MOM:/ s/.*MOM:([^"]+).*/1/p' file


      It works OK on your sample anyway...



      Notes





      • -n don't print anything until we ask for it


      • -r use ERE


      • /string/ find lines with string


      • s/old/new/ replace old with new


      • .* any number of any characters


      • ([^"]+) save some characters that are not "


      • 1 backreference to saved characters


      • p print just the lines we changed







      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Sep 3 '17 at 19:41









      Zanna

      49.6k13128237




      49.6k13128237












      • Thats really appreciable, but i just want string Manika, here m getting mANIKAnmANIKA</td><br>
        – Abhijeet Anand
        Sep 3 '17 at 21:10










      • working fine, ty :)
        – Abhijeet Anand
        Sep 3 '17 at 21:20










      • +1 and I agree about HTML caution. However this answer is great for other applications.
        – WinEunuuchs2Unix
        Nov 18 at 17:45


















      • Thats really appreciable, but i just want string Manika, here m getting mANIKAnmANIKA</td><br>
        – Abhijeet Anand
        Sep 3 '17 at 21:10










      • working fine, ty :)
        – Abhijeet Anand
        Sep 3 '17 at 21:20










      • +1 and I agree about HTML caution. However this answer is great for other applications.
        – WinEunuuchs2Unix
        Nov 18 at 17:45
















      Thats really appreciable, but i just want string Manika, here m getting mANIKAnmANIKA</td><br>
      – Abhijeet Anand
      Sep 3 '17 at 21:10




      Thats really appreciable, but i just want string Manika, here m getting mANIKAnmANIKA</td><br>
      – Abhijeet Anand
      Sep 3 '17 at 21:10












      working fine, ty :)
      – Abhijeet Anand
      Sep 3 '17 at 21:20




      working fine, ty :)
      – Abhijeet Anand
      Sep 3 '17 at 21:20












      +1 and I agree about HTML caution. However this answer is great for other applications.
      – WinEunuuchs2Unix
      Nov 18 at 17:45




      +1 and I agree about HTML caution. However this answer is great for other applications.
      – WinEunuuchs2Unix
      Nov 18 at 17:45












      up vote
      4
      down vote













      grep -Po 'MOM:K[^"]+' file.html


      Warning: this is not a very robust solution; And your HTML is not valid






      share|improve this answer



























        up vote
        4
        down vote













        grep -Po 'MOM:K[^"]+' file.html


        Warning: this is not a very robust solution; And your HTML is not valid






        share|improve this answer

























          up vote
          4
          down vote










          up vote
          4
          down vote









          grep -Po 'MOM:K[^"]+' file.html


          Warning: this is not a very robust solution; And your HTML is not valid






          share|improve this answer














          grep -Po 'MOM:K[^"]+' file.html


          Warning: this is not a very robust solution; And your HTML is not valid







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Dec 12 at 16:18

























          answered Sep 5 '17 at 16:12









          JJoao

          1,38069




          1,38069






















              up vote
              1
              down vote













              The string you're looking for always has MOM: before it, but you have not said if it always has " after it. For the purpose of this answer I will assume that you are looking for strings that are permitted to contain any lower or upper case alphabetic characters, numerals, or underscores. These are known as word characters in the terminology of regular expressions. Matching such "words" of text is useful enough that most dialects of regular expressions have features to help do so. If this isn't what you want, you can modify this solution accordingly or you can use the techniques in the other answers.



              I echo David Foerster's, Zanna's, and JJoao's wise warnings about parsing HTML with regex and about this not being robust. Please be careful, and consider if what you have requested is really exactly what you want to do. In your example code you assigned the path to the input file to the variable $file, so I will assume this has been done. You've assigned the output of your command to $y, so I will do the same.



              With grep



              This is similar to JJoao's method, and you can use that method with command substitution as well if the regular expression there is more suited to your needs.



              y="$(grep -oPm1 'MOM:Kw+' "$file")"


              -oPm1 is just a more compact way to write -o -P -m 1.





              • -o prints only the matches, not the whole line.


              • -P uses PCRE, which supports K to drop text matched so far so it's not included in the matched text that is returned.


              • -m 1 stops after matching the pattern one time. This way, you assign just the first match to the variable rather than multiple matches separated by newlines.


              Note that you can also add -m1 to the command in JJoao's answer so it uses only matches from the first line that has any.



              If the first line with a match contains multiple matches, this grep method gives you all of them. For example, if that line is MOM:MANIKA MOM:JANE"></td><br> then $y will hold the value:



              MANIKA
              JANE


              With sed



              This resembles Zanna's method.



              y="$(sed -rn '0,/.*MOM:(w+).*/ s//1/p' "$file")"


              Besides being enclosed as a command substitution, the differences are that I:




              • stop after the first line that contains a match

              • match one or more word characters (w+) instead of characters up to a " ([^"]+)

              • consume zero or more arbitrary characters (.*) first, so that MOM: doesn't have to appear at the very beginning of the line

              • use a more compact syntax that avoids writing the pattern twice.


              The technique I used for this requires GNU sed, but that's the sed implementation provided in Ubuntu.



              If the first line with a match contains multiple matches, this sed method gives you just the last one. From MOM:MANIKA MOM:JANE"></td><br> you get:



              JANE





              share|improve this answer

























                up vote
                1
                down vote













                The string you're looking for always has MOM: before it, but you have not said if it always has " after it. For the purpose of this answer I will assume that you are looking for strings that are permitted to contain any lower or upper case alphabetic characters, numerals, or underscores. These are known as word characters in the terminology of regular expressions. Matching such "words" of text is useful enough that most dialects of regular expressions have features to help do so. If this isn't what you want, you can modify this solution accordingly or you can use the techniques in the other answers.



                I echo David Foerster's, Zanna's, and JJoao's wise warnings about parsing HTML with regex and about this not being robust. Please be careful, and consider if what you have requested is really exactly what you want to do. In your example code you assigned the path to the input file to the variable $file, so I will assume this has been done. You've assigned the output of your command to $y, so I will do the same.



                With grep



                This is similar to JJoao's method, and you can use that method with command substitution as well if the regular expression there is more suited to your needs.



                y="$(grep -oPm1 'MOM:Kw+' "$file")"


                -oPm1 is just a more compact way to write -o -P -m 1.





                • -o prints only the matches, not the whole line.


                • -P uses PCRE, which supports K to drop text matched so far so it's not included in the matched text that is returned.


                • -m 1 stops after matching the pattern one time. This way, you assign just the first match to the variable rather than multiple matches separated by newlines.


                Note that you can also add -m1 to the command in JJoao's answer so it uses only matches from the first line that has any.



                If the first line with a match contains multiple matches, this grep method gives you all of them. For example, if that line is MOM:MANIKA MOM:JANE"></td><br> then $y will hold the value:



                MANIKA
                JANE


                With sed



                This resembles Zanna's method.



                y="$(sed -rn '0,/.*MOM:(w+).*/ s//1/p' "$file")"


                Besides being enclosed as a command substitution, the differences are that I:




                • stop after the first line that contains a match

                • match one or more word characters (w+) instead of characters up to a " ([^"]+)

                • consume zero or more arbitrary characters (.*) first, so that MOM: doesn't have to appear at the very beginning of the line

                • use a more compact syntax that avoids writing the pattern twice.


                The technique I used for this requires GNU sed, but that's the sed implementation provided in Ubuntu.



                If the first line with a match contains multiple matches, this sed method gives you just the last one. From MOM:MANIKA MOM:JANE"></td><br> you get:



                JANE





                share|improve this answer























                  up vote
                  1
                  down vote










                  up vote
                  1
                  down vote









                  The string you're looking for always has MOM: before it, but you have not said if it always has " after it. For the purpose of this answer I will assume that you are looking for strings that are permitted to contain any lower or upper case alphabetic characters, numerals, or underscores. These are known as word characters in the terminology of regular expressions. Matching such "words" of text is useful enough that most dialects of regular expressions have features to help do so. If this isn't what you want, you can modify this solution accordingly or you can use the techniques in the other answers.



                  I echo David Foerster's, Zanna's, and JJoao's wise warnings about parsing HTML with regex and about this not being robust. Please be careful, and consider if what you have requested is really exactly what you want to do. In your example code you assigned the path to the input file to the variable $file, so I will assume this has been done. You've assigned the output of your command to $y, so I will do the same.



                  With grep



                  This is similar to JJoao's method, and you can use that method with command substitution as well if the regular expression there is more suited to your needs.



                  y="$(grep -oPm1 'MOM:Kw+' "$file")"


                  -oPm1 is just a more compact way to write -o -P -m 1.





                  • -o prints only the matches, not the whole line.


                  • -P uses PCRE, which supports K to drop text matched so far so it's not included in the matched text that is returned.


                  • -m 1 stops after matching the pattern one time. This way, you assign just the first match to the variable rather than multiple matches separated by newlines.


                  Note that you can also add -m1 to the command in JJoao's answer so it uses only matches from the first line that has any.



                  If the first line with a match contains multiple matches, this grep method gives you all of them. For example, if that line is MOM:MANIKA MOM:JANE"></td><br> then $y will hold the value:



                  MANIKA
                  JANE


                  With sed



                  This resembles Zanna's method.



                  y="$(sed -rn '0,/.*MOM:(w+).*/ s//1/p' "$file")"


                  Besides being enclosed as a command substitution, the differences are that I:




                  • stop after the first line that contains a match

                  • match one or more word characters (w+) instead of characters up to a " ([^"]+)

                  • consume zero or more arbitrary characters (.*) first, so that MOM: doesn't have to appear at the very beginning of the line

                  • use a more compact syntax that avoids writing the pattern twice.


                  The technique I used for this requires GNU sed, but that's the sed implementation provided in Ubuntu.



                  If the first line with a match contains multiple matches, this sed method gives you just the last one. From MOM:MANIKA MOM:JANE"></td><br> you get:



                  JANE





                  share|improve this answer












                  The string you're looking for always has MOM: before it, but you have not said if it always has " after it. For the purpose of this answer I will assume that you are looking for strings that are permitted to contain any lower or upper case alphabetic characters, numerals, or underscores. These are known as word characters in the terminology of regular expressions. Matching such "words" of text is useful enough that most dialects of regular expressions have features to help do so. If this isn't what you want, you can modify this solution accordingly or you can use the techniques in the other answers.



                  I echo David Foerster's, Zanna's, and JJoao's wise warnings about parsing HTML with regex and about this not being robust. Please be careful, and consider if what you have requested is really exactly what you want to do. In your example code you assigned the path to the input file to the variable $file, so I will assume this has been done. You've assigned the output of your command to $y, so I will do the same.



                  With grep



                  This is similar to JJoao's method, and you can use that method with command substitution as well if the regular expression there is more suited to your needs.



                  y="$(grep -oPm1 'MOM:Kw+' "$file")"


                  -oPm1 is just a more compact way to write -o -P -m 1.





                  • -o prints only the matches, not the whole line.


                  • -P uses PCRE, which supports K to drop text matched so far so it's not included in the matched text that is returned.


                  • -m 1 stops after matching the pattern one time. This way, you assign just the first match to the variable rather than multiple matches separated by newlines.


                  Note that you can also add -m1 to the command in JJoao's answer so it uses only matches from the first line that has any.



                  If the first line with a match contains multiple matches, this grep method gives you all of them. For example, if that line is MOM:MANIKA MOM:JANE"></td><br> then $y will hold the value:



                  MANIKA
                  JANE


                  With sed



                  This resembles Zanna's method.



                  y="$(sed -rn '0,/.*MOM:(w+).*/ s//1/p' "$file")"


                  Besides being enclosed as a command substitution, the differences are that I:




                  • stop after the first line that contains a match

                  • match one or more word characters (w+) instead of characters up to a " ([^"]+)

                  • consume zero or more arbitrary characters (.*) first, so that MOM: doesn't have to appear at the very beginning of the line

                  • use a more compact syntax that avoids writing the pattern twice.


                  The technique I used for this requires GNU sed, but that's the sed implementation provided in Ubuntu.



                  If the first line with a match contains multiple matches, this sed method gives you just the last one. From MOM:MANIKA MOM:JANE"></td><br> you get:



                  JANE






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Sep 19 '17 at 23:25









                  Eliah Kagan

                  81.1k20227364




                  81.1k20227364






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Ask Ubuntu!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.





                      Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                      Please pay close attention to the following guidance:


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f952467%2fextracting-a-specific-string-after-a-given-string-from-html-file-using-a-bash-sc%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Category:香港粉麵

                      List *all* the tuples!

                      Channel [V]