![]() |
![]() |
![]() |
![]() |
![]() |
||||||||||
|
||||||||||||||
![]() |
#1
|
|||
|
|||
![]() Hi!
"browsing the web from the command line" is an old idea I had and didn't have the time to develop, so I publish it here and see if anyone's interested in it... The idea is not just to access the Web from a shell in text-mode (text-mode browsers exist yet ![]() $ lynx -source http://ret.mine.nu/top.swf | strings | grep http As another example, give a look at http://surfraw.sourceforge.net, which gives you the chance to redirect a web result on any command line app. Now you'll probably ask yourselves: "why should we spend time searching for these tricks while we're able to access the Web anyway, with our browsers?". Well, because in this way YOU decide what to see and what to do with the data you download, you choose what to download, you won't have any more popups and banners, and I think just this might be enough ![]()
__________________
byez, +mala |
#2
|
|||
|
|||
![]() I've moved the thread here, since it might become a little more code-oriented... in the meanwhile, I've tried some experiments with wget and I've found something which might come handy in some situations. As an example, to attract your attention, I've made my experiments on porn websites ^__^
Just take one of those "free pr0n" pages, which have new links every day that point to free sections of other websites. I've just run this single line wget -A mpg,mpeg,avi,asf -r -H -l 2 -nd -t 1 ]http://url.you.like went to the cinema and when I was back... 160MB of movies, without having to follow links, click on images, close popups, read banners and so on. How does it work? Here's a description of the switches I've used: -A mpg,mpeg,avi,asf This one makes wget save only files that end with the extensions I've specified -r Recursive: follows the links it finds in the homepage -H Span hosts: when recursive, it follows the links to foreign hosts too -l 2 (note: lowercase 'L') Recursion depth: I've set it as 2, to follow the links in the main page and then the links to the video files in the linked pages -nd No directories: doesn't create directories, puts everything in the same dir -t 1 Retries: I've set it to 1, to avoid losing time retrying after it hasn't found a server Hope you found it interesting ![]()
__________________
byez, +mala |
#3
|
|||
|
|||
![]() Hello...
Here is a way one could get the (nick)name of the last person who posted a reply to a favorite thread on this forum: $ lynx -nolist -dump 'http://ret.mine.nu/board/viewforum.php?f=3' | > grep -2 "Browsing the web" | tail -1 | awk '{ print $1 }' - GUI is designed for USERS. > All the others use cmdline ... null |
#4
|
|||
|
|||
![]() Heh, that's great!
![]() In the meanwhile, I've made a little perl script to ease link parsing. That is, a script which allows you to - extract all the links from one page - print only the ones that either - - follow a specified pattern in the URL or - - follow a specified pattern in the text "tagged" by the anchor I've tried to cut'n'paste it from the preview and it works. You can use it this way: perl filename.pl http://url "URL regexp" "text regexp" For instance, perl exturl.pl http://ret.mine.nu/links.html "" "descendants" gives you the link to Immortal Descendants mirror (this version still doesn't convert relative URLs to absolute ones, as you might see) Instead, perl exturl.pl http://ret.mine.nu/links.html "cjb" gives you only links to the websites at cjb.net perl exturl.pl http://ret.didjitalyphrozen.com/board/sear...rch_author=mala "viewforum" returns the url to all the forums of this website where I've written a message perl exturl.pl http://ret.didjitalyphrozen.com/board/sear...rch_author=mala "" "stegano|command line" returns the urls to forums or messages whose subject contain "stegano" or "command line" ----------------------------------------------------------------------------- #!/usr/bin/perl use LWP::UserAgent; use Data: ![]() # this is a very simplified implementation of getpage but should work with # no problems if you don't have to authentify yourself sub getpage{ my $url = shift; my $ua = new LWP::UserAgent; # note: for some websites we _have_ to provide agent name $ua->agent('Two/0.1'); # connect to the main url my $req = new HTTP::Request GET=> $url; my $res = $ua->request($req); die "Error connecting to the main page:n".Dumper($res->headers) unless $res->is_success; return $res->content; } sub xtracturl{ my ($content,$regexp1,$regexp2) = @_; my (@links,@links2); my %hash; # powerful regexp! Hope that works ![]() while ($content =~ /<s*a.*?hrefs*=[s]*"?([^s">]+).*?>(.*?)</a>/gsi){ my $url = $1; my $str = $2; if ($url =~ /$regexp1/i){ push (@links, $url); } if ($str =~ /$regexp2/i){ push (@links, $url); } } # clean links array from dupes for (@links){ $hash{$_}++ || push @links2,$_; } return @links2; } print join ("n",xtracturl (getpage ($ARGV[0]),$ARGV[1],$ARGV[2]))."n";
__________________
byez, +mala |
#5
|
|||
|
|||
![]() Looks like a very nice script mala, especially the regexp!
I have two things to point out: 1. You can safely change the <s*a part of the regexp to <a because no whitespaces are allowed between the tag opening sign "<" and the tag name. 2. You know me... I couldn't (almost) post anything here without pasting a nice example from my console: lynx -dump http://ret.mine.nu/links.html | sed 's/^ *[0-9]*. [^h]*//' | grep '^http' Note that by default lynx displays all the links found inside a web page when you use the -dump switch. These links are displayed at the end of the output. If you don't want lynx to display these references, you have to use the -nolist option too. null |
#6
|
|||
|
|||
![]() Quote:
![]() Quote:
![]() As usual it's very nice! And please, let me comment it to 1) See if I understood it well 2) Explain it to others so we will be able to build a little "regexp tute step by step" ![]() Quote:
1) Connect with links to http://ret.mine.nu/links.html and output the dump of the page, that is NOT the source but the translated page, with a list of links at its end (this last detail is important ![]() 2) Then the output of links is piped to sed, which takes it as an input and processes it with a substitution of "something" (we'll see it later) 3) Finally, the output of sed is piped to grep which returns only the lines which begin with http (that '^' before http means, generally, "the string begins here": in these case we are working with every line of the page processed by sed. What does sed do? 's/^ *[0-9]*. [^h]*//' good, we can see an s///, which means substitution: the syntax is s/string to substitute/new string/ and, since we have s/something//, we can understand that we actually want to DELETE something which satisfies the regexp in the first section. ^ *[0-9]*. [^h]* means: ^ the line will begin here _* (that is, <space>*) 0 or more spaces (* means 0 or more) [0-9]* 0 or more digits (well, maybe I'd put a "+" here, since we should always have at least one) ._ a dot followed by a space (the dot has to be escaped with '' because it has anoter meaning -that is 'any char'- otherwise) [^h]* anything which is not h, 0 or more times (^ at the beginning of the square brackets means 'not') Really nice, I'm going to work at the recursive exturl perl sub (I have a working version yet, but not looping, and I'm working on a looping one which allows users to specify depth - what do I mean with looping? Well, I'll explain it better in my next message ![]()
__________________
byez, +mala |
#7
|
|||
|
|||
![]() Oops!
Thanks mala for doing my job. It was actually my responsibility to explain the pipeline, but you have done it. Thanks again, it's a really good analysis - looks like you have something to do with reverse engineering, can that be!? - Greetz - null |
#8
|
|||
|
|||
![]() mala: i noticed in one of your posts a - .*? - in a regexp. i'm new to regexps and i understood
- . - means "any char" - * - means 0 or more times then, i know - ? - means 0 or 1 'times' (or if you will, 'possibly' this char...) but i still don't understand what - .*? - does? (still, i did use it in codes - along you lines - and it helped me a lot....) could you help? ps. anyway i found the book "mastering regular expressions in perl" very helpful |
#9
|
|||
|
|||
![]() Quote:
That question mark, used after the asterisk, is the "greedy" operator. Usually, in a regular expression, when you write a .* you mean "match everything" and this is quite a powerful command: if you don't specify the match to be greedy, then everything until the LAST occurrence of the text which follows the .* will match. As an example: <a href=".*"> will match both <a href="page.htm"> and <a href="page.htm" target="wherever">, which is not what you want ![]() if, instead, you write <a href=".*?"> it will match everything between double quotes, stopping at the FIRST double quote it finds. Quote:
http://www.trig.org/cs/ http://books.pdox.net/Computers/ (quite easy to find, just put the title of the book inside google ![]()
__________________
byez, +mala |
#10
|
|||
|
|||
![]() I would like to add something about the "traditional" way of greedy matching. I'll explain it on a simple example which kills the tags in a HTML file. You could do something like:
$ cat index.htm | sed 's/<[^>]*>//g' ... and it would remove all the tags from the input file (here: index.htm) under certain circumstances (to be explained by mala !!!). The regexp is here <[^>]*> which means: 1. match the '<' 2. match everything which is not a '>' zero or more times - [^>]* 3. match the '>' This way the matching stops at the first occurence of '>' in contrary to <.*> which "eats" all the chars between the first '<' and the last '>' on a line. null |