extract text from a file using terminal?

I want to process the body of text and extract an integer from a specific position in the text, but I'm not sure how to describe that 'particular position'. Regular expressions really confuse me. I spent (wasted) a couple hours reading tutorials and I feel no closer to an answer :(

There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains

id_ad=1929170&action

and then followed by a bunch of garbage I don't care about, again it may or may not include one or more integers.

So intuitively I know I just want to ignore everything up to (and including) id_ad= and ignore everything after (and including) &action and I'll be left with the integer I want. And I know I can use regular expressions to achieve this. But I can't seem to figure it out.

I'd like to do this as a one liner from terminal if possible.

4 Answers

Not so much a one liner (although the command to run it is a one liner :) ), but here is a python option:

#!/usr/bin/env python3
import sys
file = sys.argv[1]
with open(file) as src: text = src.read()
starters = [(i+6, text[i:].find("&action")+i) for i in range(len(text)) if text[i:i+6] == "id_ad="]
if len (starters) > 0: for item in starters: print(text[item[0]:item[1]])

The script first lists all occurrences (indexes) of the (start) string "id_ad=", in combination with (end) string "&action". Then it prints all that is between those "markers".

Extracted from a prepared file:

" I want to process the body of text and extract an integer from a specific position in the text, but I'm not sure how to describe that 'particular position'. Regular expressions really confuse me. I spent (wasted) a couple hours reading tutorials and I feel no closer to an answer :( There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains id_ad=1929170&action There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains id_ad=1889170&action and then followed by a bunch of garbage I don't care about, again it may or may not include one or more integers. There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains id_ad=1889170&action and then followed by a bunch of garbage I don't care about, again it may or may not include one or more integers. There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains id_ad=1929990&action"

The result is:

How to use

Paste the script into an empty file, save it as extract.py run it by the command:

python3 <script> <file>

Note

If there is only one occurrence in the text file, the script can be much shorter:

#!/usr/bin/env python3
import sys
file = sys.argv[1]
with open(file) as src: text = src.read()
print(text[text.find("id_ad=")+6:text.find("&action")])

For example:

 egrep "id_ad=[[:digit:]]+&action" file.txt | tr "=&" " " | cut -d " " -f2

...but I am sure there are more elegant ways ;-).

Step by step:

egrep "id_ad=[[:digit:]]+&action" file.txt

scan file.txt for the pattern (regular expression) that is composed by a literal id_ad=, followed by 1 or more digits (the meaning of [[:digit:]]+, followed by a literal &action. Send the output to standard output.

tr "=&" " "

transforms the characters "=" and "&" into two spaces.

cut -d " " -f2

print the second field (space-separated) of the standard input.

With sed:

sed 's/id_ad=\(.*\)&action/\1/' filename

Explanation:

Above command returns any strings(.*) between two START word(id_ad=) and END word(&action) in filename.
\(...\) Is used for capturing groups. \( is start of a capturing group and end with \). And with \1 we print the its group index(we have one capture group)

Better sed command for above solution can be like this:

sed 's/^id_ad=\([0-9]*\)&action/\1/' filename

^ Start of the line.
[0-9]*: Any number with 0 or more occurrences.
_{See for more about sed command}

With grep:

Explanation:

grep -Po '(?<=id_ad=)[0-9]*(?=&action)' filename

From man grep:

-o, --only-matching Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.
-P, --perl-regexp Interpret PATTERN as a Perl compatible regular expression (PCRE)

Returns any number with 0 or more occurrences([0-9]*) between two START word(id_ad=) and END word(&action) in filename.

(?<=pattern): Positive Lookbehind. A pair of parentheses, with the opening parenthesis followed by a question mark, "less than" symbol, and an equals sign.

(?<=id_ad=)[0-9]* (positive lookbehind) matches the 0 or more occurrences of numbers which followed after id_ad= in filename.

(?=pattern): Positive Lookahead: The positive lookahead construct is a pair of parentheses, with the opening parenthesis followed by a question mark and an equals sign.

[0-9]*(?=&action): (positive lookahead) matches 0 or more occurrences of numbers that is followed by pattern(&action), without making the pattern(&action) part of the match.
_{Read more about Lookahead and Lookbehind}

Extra links:
_{Advanced Grep Topics
GREP for Designers}

Another python answer through re module. Example stolen from Jacob's post.

script.py

#!/usr/bin/python3
import sys
import re
file = sys.argv[1]
L = [] # Declare an empty list
with open(file) as src: for j in src: # iterate through all the lines for i in re.findall(r'id_ad=(\d+)&action', j): # extracts the digits which was present in-between `id_ad=` and `&action` strings. L.append(i) # Append the extracted digits to the already declared empty list L. for f in L: # Iterate through all the elements in the list L print(f) # Print each element from the list L in a separate new line.

Run the above script as,

python3 script.py /path/to/the/file

Example:

$ cat fi
I want to process the body of text and extract an integer from a specific position in the text, but I'm not sure how to describe that 'particular position'. Regular expressions really confuse me. I spent (wasted) a couple hours reading tutorials and I feel no closer to an answer :( There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains id_ad=1929170&action There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains id_ad=1889170&action and then followed by a bunch of garbage I don't care about, again it may or may not include one or more integers. There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains id_ad=1889170&action and then followed by a bunch of garbage I don't care about, again it may or may not include one or more integers. There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains id_ad=1929990&action

$ python3 script.py ~/file
1929170
1889170
1889170
1929990

4 Answers

How to use

Note

With sed:

Explanation:

With grep:

Explanation:

Your Answer

Sign up or log in

Post as a guest

Related Journals

Where to locate png for nether portal in assets

How to get a Ships-of-the-Line?

In Score Attack mode, what affects the score?

How do I retrieve the special vehicles from my garage?