so I get that I can use wget -i text_file.txt to download images from urls in.txt but the file also has some weird indexing. I'm trying to download a dataset for machine learning and it has different classes of images.
It has something like
2598 98
2599 99
2600 00
2601 01
2602 02
2603 03 and later...
6577 77
6578 78
6579 79
6580 80
6581 81
6582 82
6583 83
6584 84 Now I would like to use the indexing and download images with the same indexes to the same file... or something like that.
Thanks!
63 Answers
That looks like a simple job for cut(1):
cut -d ' ' -f 3 < url-listing.txtYou can pipe its output directly to wget and use the “special” file name - to read from standard input:
cut -d ' ' -f 3 < url-listing.txt | wget -i - 0 You can use sed to remove the numbers at the start of each line:
sed -r 's/^[0-9]+//g' urls.txt > urls_without_numbers.txtNow you can use wget with the new url_without_numbers.txt
If for some reason you really need to do it without making a new file like above
sed -r 's/^[0-9]+//g' urls.txt | wget -i /dev/stdinIn the event that your lines contain number ranges like 1-100 try this:
sed -r 's/^[0-9\-]+//g' urls.txt > urls_without_numbers.txt 3 Seems easier to me to solve this by awk. Awk splits the by a string and then executes a command. With
for url in $(awk '{print $NF}' url1.txt | tr -d '\r'); do wget -L $url -O - | grep "preview-image"; done 2>&1 | grep "img src" | awk '{print $5}' | tr -d "\"" | awk -F'=' '{print $2}' &> real_urls.txtyou first print the last element, if the line is split by space (default). Then, you remove '\r' (which should not be in the URL) and then use the URL as argument for wget. Then, in the wget output the correct img-tag is search by grep. Afterwards, you'll need to somehow get what is after src. This is done by deleting the " (which needs to be escaped) and then by using Awk in order to get what is behind =. Then, everything is saved into real_urls.txt. Then, you can just download by:
for url in $(cat real_urls.txt); do wget ""; done 1