sourCEntral - mobile manpages

pdf

Lingua::StopWords

NAME

Lingua::StopWords - Stop words for several languages.

SYNOPSIS

use Lingua::StopWords qw( getStopWords );
my $stopwords = getStopWords('en');
my @words = qw( i am the walrus goo goo g'joob );
# prints "walrus goo goo g'joob"
print join ' ', grep { !$stopwords->{$_} } @words;

DESCRIPTION

In keyword search, it is common practice to suppress a collection of "stopwords": words such as "the", "and", "maybe", etc. which exist in in a large number of documents and do not tell you anything important about any document which contains them. This module provides such "stoplists" in several languages.

Supported Languages
|-----------------------------------------------------------|
| Language | ISO code | default encoding | also available |
|-----------------------------------------------------------|
| Danish | da | ISO-8859-1 | UTF-8 |
| Dutch | nl | ISO-8859-1 | UTF-8 |
| English | en | ISO-8859-1 | UTF-8 |
| Finnish | fi | ISO-8859-1 | UTF-8 |
| French | fr | ISO-8859-1 | UTF-8 |
| German | de | ISO-8859-1 | UTF-8 |
| Hungarian | hu | ISO-8859-2 | UTF-8 |
| Indonesian | id | ISO-8859-1 | UTF-8 |
| Italian | it | ISO-8859-1 | UTF-8 |
| Norwegian | no | ISO-8859-1 | UTF-8 |
| Portuguese | pt | ISO-8859-1 | UTF-8 |
| Romanian | ro | ISO-8859-2 | UTF-8 |
| Spanish | es | ISO-8859-1 | UTF-8 |
| Swedish | sv | ISO-8859-1 | UTF-8 |
| Russian | ru | KOI8-R | UTF-8 |
|-----------------------------------------------------------|

FUNCTIONS

getStopWords
my $stoplist = getStopWords('en');
my $utf8_stoplist = getStopWords('en', 'UTF-8');

Retrieve a stoplist in the form of a hashref where the keys are all stopwords and the values are all 1.

$stoplist = {
and => 1,
if => 1,
# ...
};

getStopWords() expects 1-2 arguments. The first, which is required, is an ISO code representing a supported language. If the ISO code cannot be found, getStopWords returns undef.

The second argument should be ’ UTF-8 ’ if you want the stopwords encoded in UTF-8. The UTF-8 flag will be turned on, so make sure you understand all the implications of that.

INSTALLATION

To install this module type the following:

perl Build.PL
./Build
./Build test
./Build install

SEE ALSO

The stoplists supplied by this module were created as part of the Snowball project (see <http://snowball.tartarus.org>, Lingua::Stem::Snowball).

Lingua::EN::StopWords provides a different stoplist for English.

SOURCE REPOSITORY

<https://github.com/wollmers/Lingua-StopWords>

AUTHOR

Maintained by Helmut Wollmersdorfer <helmut AT wollmersdorfer DOT at> and Marvin Humphrey <marvin at rectangular dot com>. Original author Fabien Potencier, <fabpot at cpan dot org>.

COPYRIGHT

Copyright 2021 Helmut Wollmersdorfer Copyright 2004-2008 Fabien Potencier, Marvin Humphrey

LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.3 or, at your option, any later version of Perl 5 you may have available.

pdf