For a project I'm working on I need to get a list of all URLs in a certain folder of a domain, or better yet all URLs matching a regular expression.
I want to do this using bash so as to avoid installing any programs that I won't ever end up using, but if there is a solution using programs I might already have, such as FireFox, please go ahead and tell me.
Thank you for you time.
81 Answer
I figured out how to manage this in my case, much should be the same for anyone else, you should be able to adapt this process to work for with any URL.
- Change to a new directory
First we should change to a new directory to avoid files getting lost or being kept after we need them.mkdir ~/Desktop/devcd ~/Desktop/dev - Get URLs with
wget
Next we use thewgetcommand to find all URLs for files and folders in the domain, for me the command was :wget -o ./urls.txt --spider -r --reject="index.html" --no-verbose --no-parent
Just replace the URL in the above command and it sould create a text file (urls.txt) full of URLs and a bunch of other nonsense. - Remove folder left by
wgetwgetwill have left behind a folder named what ever the domain of your input URL is. There is no important information in this folder, so go ahead and remove it with thermcommand or through your file manager. - Build a regex to extract the actual URLs
This is the hard part, I recommend openingurls.txtin a text edit or that allows finding with regexs and open regexer in your browser, now you have to build a . Once you find a regex that matches the URLs run the command :grep -o -E "(https.*\/([0-9](\.[0-9])+)\/(mono\/)?Godot_v\2[-_]stable[_-](mono_)?((win)?(x11[\._])?(osx\.?)?)((32)?(64)?)?((\.exe)?(\.fat)?)\.zip)" ./urls.txt >> urls\ filtered.txt
This will copy all lines matching the regex to a text file (urls filtered.txt) .Replace the regex (the bit in the quotes) with you regex.
After all that you should be left with a text file of all the URLs that you need.