Fantastic Unix Forums  

Go Back   Fantastic Unix Forums > Fantastic Unix Forums > General Unix Discussions > Unix Questions

Unix Questions General Questions About Unix.

Windows freeware unique sort technique for large text files (hosts)

Reply

 

LinkBack Thread Tools Display Modes
  #11 (permalink)  
Old 08-02-2008
B. R. 'BeAr' Ederson
Guest
 
Posts: n/a
Default Re: Windows freeware unique sort technique for large text files (hosts)

On Sat, 2 Aug 2008 13:25:30 -0700, Donita Luddington wrote:

>> There must be a way to uniquify a file from within vi freeware on windows.

>
> I found these pointers for removing duplicate lines in vi
> http://rayninfo.co.uk/vimtips.html
>:%s/^\(.*\)\n\1$/\1/ : delete duplicate lines
>
> http://www.vim.org/tips/tip.php?tip_id=305
>:%s/^\(.*\)\n\1/\1$/ delete duplicate lines
>
> But, executed in vim 7.1 on Windows, this syntax returns an error.


Try this:

:%s/^\([^\n]*\)\n\1$/\1/

Please note, that you have to sort the file *beforehand*! The above
will only remove *consecutive* duplicate lines. And if you don't
have any consecutive duplicate lines, you *will* get an "error".
(Pattern not found.)

BeAr
--
================================================== =========================
= What do you mean with: "Perfection is always an illusion"? =
================================================== =============--(Oops!)===
Reply With Quote
  #12 (permalink)  
Old 08-03-2008
Johnw
Guest
 
Posts: n/a
Default Re: Windows freeware unique sort technique for large text files (hosts)

Donita Luddington has brought this to us :
> Is there a way, using windows freeware, to sort unique a huge hosts file?


Here is some info, that may interest.

Hosts File
http://home.comcast.net/~SupportCD/XPMyths.html
Myth - "Special AntiSpyware Hosts Files are necessary to prevent
Spyware infections."
Reality - "Using Special AntiSpyware Hosts Files are a waste of time
and leads to a false sense of security. Any Malware/Spyware can easily
modify the Hosts File at will, even if it is set to Read-only. It is
impossible to "lock-down" a Hosts File unless you are running as a
limited user which makes using it in this case irrelevant anyway.
Various Malware/Spyware uses the Hosts File to redirect your Web
Browser to other sites. They can also redirect Windows to use a Hosts
File that has nothing to do with the one you keep updating. The Hosts
file is an archaic part of networking setups that was originally meant
to be used on a LAN and was the legacy way to look up Domain Names on
the ARPANET. It tells a PC the fixed numeric address of the internal
server(s) so the PC doesn't have to go looking for them through all
possible addresses. It can save time when "discovering" a LAN. I don't
consider 1970's ARPANET technology useful against modern
Malware/Spyware. When cleaning Malware/Spyware from a PC, it is much
easier to check a clean Hosts File then one filled with thousands of
lines of addresses. Considering how easily a Hosts File can be
exploited, redirected and potentially block good sites, it is strongly
recommended NOT to waste time using Special Hosts Files. Especially
when proper Malware/Spyware protection can be achieved by simply using
these steps, all without ever using a Hosts File."


Reply With Quote
  #13 (permalink)  
Old 08-03-2008
krazycarnie
Guest
 
Posts: n/a
Default Re: Windows freeware unique sort technique for large text files(hosts)

On Aug 2, 2:00*pm, Donita Luddington <donil...@sbcglobal.net> wrote:
> Is there a way, using windows freeware, to sort unique a huge hosts file?
>
> I've concatonated all the freeware windows hosts files I can find into a
> single huge fifty-thousand line C:\Windows\System\Drivers\Etc\hosts file
> but the resulting hosts file is so huge, replete with duplicates, that it's
> slowing down windows browsing.
>
> I would like to pare the hosts file to remove duplicates. How?
>
> I tried sorting with windows vim 7.1 freeware but I can't get the unique
> sort option to work inside of vim. What am I doing wrong?
>
> Here is a vim 7.1 command that works inside the huge hosts file:
> * :%!sort *(this sorts the huge windows hosts file just fine)
>
> This vim 7.1 sort unique command should work but it does not:
> * :%!sort -u (this is supposed to sort uniquely)
>
> The syntax is:
> <esc>: * *(begin a windows vim 7.1 command)
> !sort -u *(run the following command "sort -u" inside of vim freeware)
>
> When I run "<esc>:!sort -u" inside of vim, it pares the hosts file down to
> a single (empty) line.
>
> Is there another free way to sort uniquely a large windows text file?


For the best info on HOSTS files and managing them I have found this
site : http://www.mvps.org/winhelp2002/hosts.htm to be very useful.
Not only do they publish a very capable HOSTS file, they have free and
non-free software listed that will allow you to manage your HOSTS
file. As well there are several other tips and tricks that I find
useful.

The Carnie
Reply With Quote
  #14 (permalink)  
Old 08-04-2008
beerwolf
Guest
 
Posts: n/a
Default Re: Windows freeware unique sort technique for large text files (hosts)

Donita Luddington wrote:

> Is there a way, using windows freeware, to sort unique a huge hosts file?
>
> I've concatonated all the freeware windows hosts files I can find into a
> single huge fifty-thousand line C:\Windows\System\Drivers\Etc\hosts file
> but the resulting hosts file is so huge, replete with duplicates, that
> it's
> slowing down windows browsing.
>
> I would like to pare the hosts file to remove duplicates. How?
>
> I tried sorting with windows vim 7.1 freeware but I can't get the unique
> sort option to work inside of vim. What am I doing wrong?
>
> Here is a vim 7.1 command that works inside the huge hosts file:
> :%!sort (this sorts the huge windows hosts file just fine)
>
> This vim 7.1 sort unique command should work but it does not:
> :%!sort -u (this is supposed to sort uniquely)
>
> The syntax is:
> <esc>: (begin a windows vim 7.1 command)
> !sort -u (run the following command "sort -u" inside of vim freeware)
>
> When I run "<esc>:!sort -u" inside of vim, it pares the hosts file down to
> a single (empty) line.
>
> Is there another free way to sort uniquely a large windows text file?


Unduplicate, downloadable from http://adriancarter.homestead.com/
might be able to do it, depending on how wide are the lines in your
file. You mention a 50,000 line file; to test Unduplicate just now
I created ~65,500 random lines in Excel, with a high degree of
duplication, then copied to the clipboard. Unduplicate reduced it
to about 9900 unique values in less than 10 seconds. I then created
the same data but in ~130,000 lines of a text file, and it didn't take
much longer.
I'm away from my development setup at present, using an old slow
early XP system with 250Mb memory. The reason I can't give
exact timings is that Unduplicate gives no signal after it has done
its thing with the clipboard. A weakness I intend to remedy as soon
as I get back home. But it will probably work for you - you just
have to select all in an editor, copy, click on the Unduplicate tray
icon, wait a while, then paste.

--
beerwolf


Reply With Quote
  #15 (permalink)  
Old 08-04-2008
jpd
Guest
 
Posts: n/a
Default Re: Windows freeware unique sort technique for large text files (hosts)

On Sat, 2 Aug 2008 11:00:26 -0700,
Donita Luddington <doniludd@sbcglobal.net> wrote:
> Is there a way, using windows freeware, to sort unique a huge hosts file?
>
> I've concatonated all the freeware windows hosts files I can find into a
> single huge fifty-thousand line C:\Windows\System\Drivers\Etc\hosts file
> but the resulting hosts file is so huge, replete with duplicates, that it's
> slowing down windows browsing.


I suspect that even with removing all the duplicates you'll still end up
with a file that's a tad big for the usual hosts lookup implementation.
Likely, each lookup will end up reading the entire file line-by-line
until the first hit or end-of-file, whichever comes first.

I think you may need to look at a better solution; firefox with adblock
for example. I assume but have not verified whether adblock's lookup is
faster, mind. I do know that abusing the hosts file for keeping huge
blacklists is more likely to hurt than to help, and not just in slowness.


> I would like to pare the hosts file to remove duplicates. How?


The easy way for someone with unix experience is to run it through sort,
then uniq. Various editors (emacs, vi(m), probably more) can do it too.
Various ways for obtaining a unix toolset have already been mentioned.

There is a freeware windows implementation available of the programming
(scripting, really) language ``awk''[awk]. The installation consists of
fetching a single executable and putting it somewhere convenient, then
run it with the appropriate arguments (program to execute or file where
the program to execute resides, input files, perhaps output redirection).

Implementing sort in it would be a bit involved, but an in-place ``uniq''
that doesn't need sorting turns out to be easy. In a dos-box, run:

awk '!_[$0]++' inputfile > outputfile

On unix shells you may need to escape the !, but I don't think you need
to on a windows command line, though I'm not sure just how it handles
quoting. This is a bit of a hack in that it is nigh-on unreadable for a
beginner, so let me reassure you that it is entirely possible to write
very readable awk programs. It has been deployed with success as a
language for non-programmers, in fact.


[awk] http://plan9.bell-labs.com/cm/cs/awkbook/ which links to
http://plan9.bell-labs.com/cm/cs/who/bwk/awk95.exe

--
j p d (at) d s b (dot) t u d e l f t (dot) n l .
This message was originally posted on Usenet in plain text.
Any other representation, additions, or changes do not have my
consent and may be a violation of international copyright law.
Reply With Quote
  #16 (permalink)  
Old 08-17-2008
Donita Luddington
Guest
 
Posts: n/a
Default Re: Windows freeware unique sort technique for large text files (hosts)

Hi Guys,

By way of update, I followed Bear's and others' original advice and was
able to sort the now fifty-thousand line hosts file in about a second or
two on Windows.

What I did was add native Win32 port of the UnixUtils at
http://unxutils.sourceforge.net to my WinXP laptop.

This created c:\bin and c:\usr and, more specifically
C:\usr\local\wbin\sort.exe

Thanks to you, this more powerful sort, containing the "unique" and "ouput"
-u and -o options is part of my Windows command-line repertoire.

It wasn't at first obvious (to me), but, Wikipedia helped with syntax:
http://en.wikipedia.org/wiki/Sort_(Unix)

For others, here's the command to pare down the hosts file after you've
combined all those hosts files you can find on the Internet using sort:

Start->Run->cmd
type c:\windows\system32\drivers\etc\hosts | c:\usr\local\wbin\sort.exe -u
-o c:\windows\system32\drivers\etc\hosts

The only manual change needed was to move this line back to the top:
127.0.0.1 localhost # this needs to be the first line for some reason

Do you know if sort can be told to sort all but the first line?

It would be nice if the sort command could sort from line 2 to the end so
that the extra step of moving the localhost line wasn't needed.
Reply With Quote
  #17 (permalink)  
Old 08-17-2008
Donita Luddington
Guest
 
Posts: n/a
Default Re: Windows freeware unique sort technique for large text files (hosts)

On Sat, 2 Aug 2008 22:27:18 +0200, B. R. 'BeAr' Ederson wrote:

> (Better set up a dedicated UnxUtils directory with
> entry in the search path, though.)


Thanks Bear!

Your unxutils advice worked beautifully.

Start->Run->cmd
type c:\windows\system32\drivers\etc\hosts | c:\usr\local\wbin\sort.exe -u
-o c:\windows\system32\drivers\etc\hosts

The only manual change needed was to move this line back to the top:
127.0.0.1 localhost # this needs to be the first line for some reason

I'm digging for the sort command that only sorts from the second line down
but haven't found it yet.
Reply With Quote
  #18 (permalink)  
Old 08-17-2008
Anand Hariharan
Guest
 
Posts: n/a
Default Re: Windows freeware unique sort technique for large text files(hosts)

On Sun, 17 Aug 2008 08:16:14 -0700, Donita Luddington
<doniludd@sbcglobal.net> wrote:

(...)
>
> Start->Run->cmd
> type c:\windows\system32\drivers\etc\hosts | c:\usr\local\wbin\sort.exe
> -u -o c:\windows\system32\drivers\etc\hosts
>


That qualifies for a UUOC (well, 'type' in this case).


> The only manual change needed was to move this line back to the top:
> 127.0.0.1 localhost # this needs to be the first line for some reason
>
> I'm digging for the sort command that only sorts from the second line
> down but haven't found it yet.


Am guessing there must be some variant/clone of 'sed' included in
UnxUtils. If not, since you are so keen on calling sort from within vim,
you can simply do -

:2,$! C:\usr\local\wbin\sort -u

- from within a vim session that is editing your hosts file.

- Anand

Reply With Quote
  #19 (permalink)  
Old 08-17-2008
B. R. 'BeAr' Ederson
Guest
 
Posts: n/a
Default Re: Windows freeware unique sort technique for large text files (hosts)

On Sun, 17 Aug 2008 08:16:14 -0700, Donita Luddington wrote:

> On Sat, 2 Aug 2008 22:27:18 +0200, B. R. 'BeAr' Ederson wrote:
>
>> (Better set up a dedicated UnxUtils directory with
>> entry in the search path, though.)

>
> Thanks Bear!


You're welcome. :-) Besides, it is BeAr, not Bear. ;-)

> The only manual change needed was to move this line back to the top:
> 127.0.0.1 localhost # this needs to be the first line for some reason
>
> I'm digging for the sort command that only sorts from the second line down
> but haven't found it yet.


The following command line should contain all commands in a one liner:

sed "/127\.0\.0\.1/d" hosts | tr '[A-Z]' '[a-z]' | sort -u | sed "1i127.0.0.1 localhost" > hosts

If UnxUtils are not part of the PATH search string, all utilities
need to be called with fully qualified name. The "hosts" entries
have to be substituted with the full name including directory
components, if the command is not executed from the directory
containing that hosts file. (Which would be easier...)

Although the above should work fine, it usually is better to create
a hosts.new file first and rename it afterwards. But that's up to you.

There are other ways to do the above. I settled with deleting lines
containing the localhost (127.0.0.1) entries instead of just preserving
the first line, because the merging of several hosts files may result
in more than one localhost line...

HTH.
BeAr
--
================================================== =========================
= What do you mean with: "Perfection is always an illusion"? =
================================================== =============--(Oops!)===
Reply With Quote
  #20 (permalink)  
Old 08-17-2008
B. R. 'BeAr' Ederson
Guest
 
Posts: n/a
Default Re: Windows freeware unique sort technique for large text files (hosts)

On Sun, 17 Aug 2008 18:19:56 +0200 (CEST), Anand Hariharan wrote:

>> The only manual change needed was to move this line back to the top:
>> 127.0.0.1 localhost # this needs to be the first line for some reason
>>
>> I'm digging for the sort command that only sorts from the second line
>> down but haven't found it yet.

>
> Am guessing there must be some variant/clone of 'sed' included in
> UnxUtils.


There is. ;-)

BeAr
--
================================================== =========================
= What do you mean with: "Perfection is always an illusion"? =
================================================== =============--(Oops!)===
Reply With Quote
Reply

Tags
freeware, hosts, large, sort, technique, text, unique, windows


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)

 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump

Similar Threads

Thread Thread Starter Forum Replies Last Post
Comparing two text files with non-adjacent lines for unique entries tntelle@yahoo.com Unix Shell Programming 37 06-27-2008 11:15 PM
Comparing two text files with non-adjacent lines for unique entries tntelle@yahoo.com Unix Shell Programming 2 06-27-2008 11:13 PM
Sort command with very large files attraxion Unix Shell Programming 25 06-27-2008 08:28 PM
Sort command with very large files attraxion Unix Shell Programming 0 06-27-2008 08:26 PM
Comparing two text files with non-adjacent lines for unique Unix Shell Programming 10 08-17-2007 01:17 PM


All times are GMT +1. The time now is 07:24 AM.


Powered by vBulletin® Version 3.7.2
Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO 3.2.0