How do I extract all the external links of a web page and save them to a file?
If there is any command line tools that would be great.
It was quite the same question here, and the answer worked gracefully for the google.com, but for some reason it doesn't work with e.g. youtube. I'll explain: let's take for example this page. If I try to run
lynx -dump | awk '/http/{print $2}' | grep watch > links.txtthen it, unlike using it on google.com firstly executes lynx's dump, followed by giving control to awk ( for some reason with empty input ), and finally writes nothing to the file links.txt. Only after that it displays non-filtered dump of lynx, without a possibility to transfer it elsewhere.
Thank you in advance!
13 Answers
lynx -dump ' | awk '/http/{print $2}' | grep watch > links.txtworks. You need to escape the & in the link.
In your original line, the unescaped & will throw Lynx to the background, leaving empty input for links.txt. The background process will still write its output to the terminal you are in, but as you noticed, it will not do the > redirect (ambiguity: which process should write to the file?).
Addendum: I'm assuming a typo in your original command: the beginning and ending ' should not be present. Otherwise you'll get other error messages trying to execute a non-existing command. Removing those gives the behavior you describe.
Use your favorite website and search for 'website scraper script' or 'website scraping script' and whatever programming language you are most comfortable with. You have thousands and thousands of options, so do the most detailed search you can.
While there are lots of options to choose from, I would recommend using python with BeautifilSoup - this would give you total control of the process, including following redirects, handling self-signed/expired SSL certs, working around invalid HTML, extracting links only from specific page blocks, etc.
For an example check out this thread:
Installing BeautifilSoup is as trivial as running pip install BeautifilSoup or easy_install BeautifilSoup if you are on linux. On win32 it is probably the easiest to use binary installers.