Reverse Engineering RET Homepage RET Members Reverse Engineering Projects Reverse Engineering Papers Reversing Challenges Reverser Tools RET Re-Search Engine Reverse Engineering Forum Reverse Engineering Links

Go Back   Reverse Engineering Team Board > Reverse Engineering Board > Reverse Code Engineering
FAQ Members List Calendar Search Today's Posts Mark Forums Read

Reply
 
Thread Tools Display Modes
  #1  
Old 12-26-2002, 07:50 AM
mala mala is offline
Administrator
 
Join Date: Dec 2002
Posts: 41
Default Browsing the web from the command line

Hi!

"browsing the web from the command line" is an old idea I had and didn't have the time to develop, so I publish it here and see if anyone's interested in it...

The idea is not just to access the Web from a shell in text-mode (text-mode browsers exist yet ), but to use shell commands -or new hand-made ones- to retrieve only the information we're interested in, or to get the info we couldn't see with a normal text browser. As an example, look at this oneliner from null, which prints absolute urls saved in a flash file:

$ lynx -source http://ret.mine.nu/top.swf | strings | grep http

As another example, give a look at http://surfraw.sourceforge.net, which gives you the chance to redirect a web result on any command line app.

Now you'll probably ask yourselves: "why should we spend time searching for these tricks while we're able to access the Web anyway, with our browsers?". Well, because in this way YOU decide what to see and what to do with the data you download, you choose what to download, you won't have any more popups and banners, and I think just this might be enough
__________________
byez,

+mala
Reply With Quote
  #2  
Old 12-26-2002, 07:00 PM
mala mala is offline
Administrator
 
Join Date: Dec 2002
Posts: 41
Default well...

I've moved the thread here, since it might become a little more code-oriented... in the meanwhile, I've tried some experiments with wget and I've found something which might come handy in some situations. As an example, to attract your attention, I've made my experiments on porn websites ^__^

Just take one of those "free pr0n" pages, which have new links every day that point to free sections of other websites. I've just run this single line

wget -A mpg,mpeg,avi,asf -r -H -l 2 -nd -t 1 ]http://url.you.like

went to the cinema and when I was back... 160MB of movies, without having to follow links, click on images, close popups, read banners and so on.

How does it work? Here's a description of the switches I've used:

-A mpg,mpeg,avi,asf
This one makes wget save only files that end with the extensions I've specified

-r
Recursive: follows the links it finds in the homepage

-H
Span hosts: when recursive, it follows the links to foreign hosts too

-l 2 (note: lowercase 'L')
Recursion depth: I've set it as 2, to follow the links in the main page and then the links to the video files in the linked pages

-nd
No directories: doesn't create directories, puts everything in the same dir

-t 1
Retries: I've set it to 1, to avoid losing time retrying after it hasn't found a server


Hope you found it interesting
__________________
byez,

+mala
Reply With Quote
  #3  
Old 01-04-2003, 08:58 PM
null null is offline
Member
 
Join Date: Dec 2002
Posts: 9
Default

Hello...







Here is a way one could get the (nick)name of the last person who posted a reply to a favorite thread on this forum:







$ lynx -nolist -dump 'http://ret.mine.nu/board/viewforum.php?f=3' |


> grep -2 "Browsing the web" | tail -1 | awk '{ print $1 }'








- GUI is designed for USERS.

> All the others use cmdline ...







null
Reply With Quote
  #4  
Old 01-05-2003, 05:23 PM
mala mala is offline
Administrator
 
Join Date: Dec 2002
Posts: 41
Default eheh...

Heh, that's great!

In the meanwhile, I've made a little perl script to ease link parsing. That is, a script which allows you to

- extract all the links from one page
- print only the ones that either
- - follow a specified pattern in the URL or
- - follow a specified pattern in the text "tagged" by the anchor

I've tried to cut'n'paste it from the preview and it works. You can use it this way:

perl filename.pl http://url "URL regexp" "text regexp"

For instance,

perl exturl.pl http://ret.mine.nu/links.html "" "descendants"
gives you the link to Immortal Descendants mirror (this version still doesn't convert relative URLs to absolute ones, as you might see)

Instead,
perl exturl.pl http://ret.mine.nu/links.html "cjb"
gives you only links to the websites at cjb.net

perl exturl.pl http://ret.didjitalyphrozen.com/board/sear...rch_author=mala "viewforum"
returns the url to all the forums of this website where I've written a message

perl exturl.pl http://ret.didjitalyphrozen.com/board/sear...rch_author=mala "" "stegano|command line"
returns the urls to forums or messages whose subject contain "stegano" or "command line"

-----------------------------------------------------------------------------

#!/usr/bin/perl

use LWP::UserAgent;
use Data:umper; # quick and dirty way to dump data on the screen

# this is a very simplified implementation of getpage but should work with
# no problems if you don't have to authentify yourself

sub getpage{
my $url = shift;
my $ua = new LWP::UserAgent;

# note: for some websites we _have_ to provide agent name
$ua->agent('Two/0.1');

# connect to the main url
my $req = new HTTP::Request GET=> $url;
my $res = $ua->request($req);

die "Error connecting to the main page:n".Dumper($res->headers) unless $res->is_success;

return $res->content;
}

sub xtracturl{
my ($content,$regexp1,$regexp2) = @_;

my (@links,@links2);
my %hash;

# powerful regexp! Hope that works
while ($content =~ /<s*a.*?hrefs*=[s]*"?([^s">]+).*?>(.*?)</a>/gsi){
my $url = $1;
my $str = $2;
if ($url =~ /$regexp1/i){
push (@links, $url);
}
if ($str =~ /$regexp2/i){
push (@links, $url);
}
}

# clean links array from dupes
for (@links){
$hash{$_}++ || push @links2,$_;
}

return @links2;
}

print join ("n",xtracturl (getpage ($ARGV[0]),$ARGV[1],$ARGV[2]))."n";
__________________
byez,

+mala
Reply With Quote
  #5  
Old 01-07-2003, 10:13 PM
null null is offline
Member
 
Join Date: Dec 2002
Posts: 9
Default nice script ...

Looks like a very nice script mala, especially the regexp!





I have two things to point out:





1. You can safely change the <s*a part of the regexp to <a because no whitespaces are allowed between the tag opening sign "<" and the tag name.





2. You know me... I couldn't (almost) post anything here without pasting a nice example from my console:





lynx -dump http://ret.mine.nu/links.html | sed 's/^ *[0-9]*. [^h]*//' | grep '^http'





Note that by default lynx displays all the links found inside a web page when you use the -dump switch. These links are displayed at the end of the output. If you don't want lynx to display these references, you have to use the -nolist option too.



null
Reply With Quote
  #6  
Old 01-08-2003, 01:42 PM
mala mala is offline
Administrator
 
Join Date: Dec 2002
Posts: 41
Default Re: nice script ...

Quote:
1. You can safely change the <s*a part of the regexp to <a because no whitespaces are allowed between the tag opening sign \"<\" and the tag name.
Good! I wasn't sure about that, so I added those chars without even reading any html specs

Quote:
2. You know me... I couldn't (almost) post anything here without pasting a nice example from my console:
I know you, and I couldn't wait to see it

As usual it's very nice! And please, let me comment it to

1) See if I understood it well
2) Explain it to others so we will be able to build a little "regexp tute step by step"

Quote:

lynx -dump http://ret.mine.nu/links.html | sed 's/^ *[0-9]*. [^h]*//' | grep '^http'
Well, the steps of this oneliner are the following ones:

1) Connect with links to http://ret.mine.nu/links.html and output the dump of the page, that is NOT the source but the translated page, with a list of links at its end (this last detail is important )
2) Then the output of links is piped to sed, which takes it as an input and processes it with a substitution of "something" (we'll see it later)
3) Finally, the output of sed is piped to grep which returns only the lines which begin with http (that '^' before http means, generally, "the string begins here": in these case we are working with every line of the page processed by sed.

What does sed do?

's/^ *[0-9]*. [^h]*//'

good, we can see an s///, which means substitution: the syntax is

s/string to substitute/new string/

and, since we have s/something//, we can understand that we actually want to DELETE something which satisfies the regexp in the first section.

^ *[0-9]*. [^h]* means:

^ the line will begin here
_* (that is, <space>*) 0 or more spaces (* means 0 or more)
[0-9]* 0 or more digits (well, maybe I'd put a "+" here, since we should always have at least one)
._ a dot followed by a space (the dot has to be escaped with '' because it has anoter meaning -that is 'any char'- otherwise)
[^h]* anything which is not h, 0 or more times (^ at the beginning of the square brackets means 'not')

Really nice, I'm going to work at the recursive exturl perl sub (I have a working version yet, but not looping, and I'm working on a looping one which allows users to specify depth - what do I mean with looping? Well, I'll explain it better in my next message )
__________________
byez,

+mala
Reply With Quote
  #7  
Old 01-11-2003, 06:04 PM
null null is offline
Member
 
Join Date: Dec 2002
Posts: 9
Default thanks ...

Oops!

Thanks mala for doing my job. It was actually my responsibility to explain the pipeline, but you have done it. Thanks again, it's a really good analysis - looks like you have something to do with reverse engineering, can that be!?

- Greetz -

null
Reply With Quote
  #8  
Old 01-11-2003, 10:45 PM
score score is offline
Junior Member
 
Join Date: Jan 2003
Posts: 1
Default

mala: i noticed in one of your posts a - .*? - in a regexp. i'm new to regexps and i understood
- . - means "any char"
- * - means 0 or more times
then, i know - ? - means 0 or 1 'times' (or if you will, 'possibly' this char...)

but i still don't understand what - .*? - does?
(still, i did use it in codes - along you lines - and it helped me a lot....)

could you help?


ps. anyway i found the book "mastering regular expressions in perl" very helpful
Reply With Quote
  #9  
Old 01-12-2003, 09:20 AM
mala mala is offline
Administrator
 
Join Date: Dec 2002
Posts: 41
Default

Quote:
mala: i noticed in one of your posts a - .*? - in a regexp. i'm new to regexps and i understood *
- . - means \"any char\" *
- * - means 0 or more times
then, i know - ? - means 0 or 1 'times' (or if you will, 'possibly' this char...)

but i still don't understand what - .*? - does?
(still, i did use it in codes - along you lines - and it helped me a lot....)

could you help?
Sure!

That question mark, used after the asterisk, is the "greedy" operator. Usually, in a regular expression, when you write a .* you mean "match everything" and this is quite a powerful command: if you don't specify the match to be greedy, then everything until the LAST occurrence of the text which follows the .* will match.

As an example:

<a href=".*">

will match both <a href="page.htm"> and <a href="page.htm" target="wherever">, which is not what you want

if, instead, you write

<a href=".*?">

it will match everything between double quotes, stopping at the FIRST double quote it finds.

Quote:
ps. anyway i found the book \"mastering regular expressions in perl\" very helpful
Here's a couple of links for the ones who would like to give a look at this book:

http://www.trig.org/cs/
http://books.pdox.net/Computers/

(quite easy to find, just put the title of the book inside google )
__________________
byez,

+mala
Reply With Quote
  #10  
Old 01-15-2003, 07:15 PM
null null is offline
Member
 
Join Date: Dec 2002
Posts: 9
Default a small addition ...

I would like to add something about the "traditional" way of greedy matching. I'll explain it on a simple example which kills the tags in a HTML file. You could do something like:

$ cat index.htm | sed 's/<[^>]*>//g'

... and it would remove all the tags from the input file (here: index.htm) under certain circumstances (to be explained by mala !!!). The regexp is here <[^>]*> which means:

1. match the '<'
2. match everything which is not a '>' zero or more times - [^>]*
3. match the '>'

This way the matching stops at the first occurence of '>' in contrary to <.*> which "eats" all the chars between the first '<' and the last '>' on a line.

null
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump





Powered by vBulletin® Version 3.6.4
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.