How to parse through a text file which has urls and indexing, and download images from it?

so I get that I can use wget -i text_file.txt to download images from urls in.txt but the file also has some weird indexing. I'm trying to download a dataset for machine learning and it has different classes of images.

It has something like

and later...

Now I would like to use the indexing and download images with the same indexes to the same file... or something like that.

Thanks!

3 Answers

That looks like a simple job for cut(1):

cut -d ' ' -f 3 < url-listing.txt

You can pipe its output directly to wget and use the “special” file name - to read from standard input:

cut -d ' ' -f 3 < url-listing.txt | wget -i -

You can use sed to remove the numbers at the start of each line:

sed -r 's/^[0-9]+//g' urls.txt > urls_without_numbers.txt

Now you can use wget with the new url_without_numbers.txt

If for some reason you really need to do it without making a new file like above

sed -r 's/^[0-9]+//g' urls.txt | wget -i /dev/stdin

In the event that your lines contain number ranges like 1-100 try this:

sed -r 's/^[0-9\-]+//g' urls.txt > urls_without_numbers.txt

Seems easier to me to solve this by awk. Awk splits the by a string and then executes a command. With

for url in $(awk '{print $NF}' url1.txt | tr -d '\r'); do wget -L $url -O - | grep "preview-image"; done 2>&1 | grep "img src" | awk '{print $5}' | tr -d "\"" | awk -F'=' '{print $2}' &> real_urls.txt

you first print the last element, if the line is split by space (default). Then, you remove '\r' (which should not be in the URL) and then use the URL as argument for wget. Then, in the wget output the correct img-tag is search by grep. Afterwards, you'll need to somehow get what is after src. This is done by deleting the " (which needs to be escaped) and then by using Awk in order to get what is behind =. Then, everything is saved into real_urls.txt. Then, you can just download by:

for url in $(cat real_urls.txt); do wget ""; done

3 Answers

Your Answer

Sign up or log in

Post as a guest

Related Journals

Where to locate png for nether portal in assets

How to get a Ships-of-the-Line?

In Score Attack mode, what affects the score?

How do I retrieve the special vehicles from my garage?