Glam Prestige Journal

Bright entertainment trends with youth appeal.

so I get that I can use wget -i text_file.txt to download images from urls in.txt but the file also has some weird indexing. I'm trying to download a dataset for machine learning and it has different classes of images.

It has something like

2598 98
2599 99
2600 00
2601 01
2602 02
2603 03 

and later...

6577 77
6578 78
6579 79
6580 80
6581 81
6582 82
6583 83
6584 84 

Now I would like to use the indexing and download images with the same indexes to the same file... or something like that.

Thanks!

6

3 Answers

That looks like a simple job for cut(1):

cut -d ' ' -f 3 < url-listing.txt

You can pipe its output directly to wget and use the “special” file name - to read from standard input:

cut -d ' ' -f 3 < url-listing.txt | wget -i -
0

You can use sed to remove the numbers at the start of each line:

sed -r 's/^[0-9]+//g' urls.txt > urls_without_numbers.txt

Now you can use wget with the new url_without_numbers.txt

If for some reason you really need to do it without making a new file like above

sed -r 's/^[0-9]+//g' urls.txt | wget -i /dev/stdin

In the event that your lines contain number ranges like 1-100 try this:

sed -r 's/^[0-9\-]+//g' urls.txt > urls_without_numbers.txt
3

Seems easier to me to solve this by awk. Awk splits the by a string and then executes a command. With

for url in $(awk '{print $NF}' url1.txt | tr -d '\r'); do wget -L $url -O - | grep "preview-image"; done 2>&1 | grep "img src" | awk '{print $5}' | tr -d "\"" | awk -F'=' '{print $2}' &> real_urls.txt

you first print the last element, if the line is split by space (default). Then, you remove '\r' (which should not be in the URL) and then use the URL as argument for wget. Then, in the wget output the correct img-tag is search by grep. Afterwards, you'll need to somehow get what is after src. This is done by deleting the " (which needs to be escaped) and then by using Awk in order to get what is behind =. Then, everything is saved into real_urls.txt. Then, you can just download by:

for url in $(cat real_urls.txt); do wget ""; done
1

Your Answer

Sign up or log in

Sign up using Google Sign up using Facebook Sign up using Email and Password

Post as a guest

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy