Lab1:A Simple Inverted Index

实验要求

  1. 部分单词出现特别频繁以致于其在倒排表中的是一个噪声:会影响文档的其他更有代表性的单词的特征,在本实验的第一部分,需要对莎士比亚全集文档进行单词统计以找出这样的单词
  2. 设计一个Map-Reduce 算法在爬取的网络数据上计算其倒排索引表,最终的倒排索引表中不能包含在第一步中识别出的单词
  3. 至少进行两项扩展:
    • 数据清洗:不区分大小写,标点符号不敏感等等
    • 在倒排索引文件的基础上实现一个查询程序,接收用户指定的单词或短语,返回包含这些词语的文档的id
    • 实现一个完全倒排索引,记录单词所在文档id及在文档中的位置

实现

Part 1: WordCount 统计高频噪声词

直接使用Hadoop的示例WordCount代码运行即可(去除标点、大小写影响),最终单词共24418个,然后使用python对单词词频分布情况进行统计,统计结果如下:

Data Frequency Destribute

词频数最高的2000个单词的分布情况如图1所示,据此判定噪声词为词频大于300的单词,使用python提取出这部分单词。

Part 2:倒排索引

将Part 1中得到的高频词传到HDFS目录下,在MapReduce项目中以CacheFile的形式加载,如下:

1
2
String filename = "/user/alexzhangch/inverted_index/stop.txt";
job.addCacheFile(new Path(filename).toUri());

然后在Map阶段,将该高频词存入HashMap,构建倒排索引时,对每个单词,查询其是否在map中,不是高频词则记录下对应的结果。

Part3: 扩展

扩展一: 数据清洗

A naive parser will group words by attributes which are not relevant to their meaning.Modify your parser to “scrub” words. You can define “scrub” however you wish; some suggestions include case-insensitivity, punctuation-insensitivity, etc. You will get extra karma for creating a language-independent scrubbing algorithm, but this is not required.

实现方式:

通过String类的replace方法将标点符号,特殊符号等替换,并忽略大小写,只保留单个单词,构建索引

1
2
3
// 把每一句文本分割为单词,根据如下的标点符号进行分割
String cleanLine = value.toString().toLowerCase().replaceAll("[_|&$#<>\\^=\\[\\]\\*/\\\\,;,.\\-:()?!\"']", " ");
StringTokenizer itr = new StringTokenizer(cleanLine);

扩展二: 完全倒排索引

Instead of creating an inverted file index (which maps words to their document ID), create a full inverted index (which maps words to their document ID + position in the document). How do you specify a word’s position in the document?

实现方式:

使用的是TextInputFormat,每一条记录是输入文件中的某一行, key为该行在文档中的偏移字节数,value为改行的文本,而根据Hadoop的文档,一个MAP 实例处理一个文件,在MAP实例中会将文件一行一行拆开然后顺序调用map函数,因此可以在MAP class采用一个int整数记录行数,然后在处理每一行数据时记录单词在改行的位置,因此map阶段最终的value格式为

1
filename-linenumber-offset

具体实现如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
public void map(LongWritable position, Text value, Context context
) throws IOException, InterruptedException {
line_number += 1;

Integer line_off = line_number;

String file_name = ((FileSplit) context.getInputSplit()).getPath().getName().toString();
// 把每一句文本分割为单词,根据如下的标点符号进行分割
String cleanLine = value.toString().toLowerCase().replaceAll("[_|&$#<>\\^=\\[\\]\\*/\\\\,;,.\\-:()?!\"']", " ");

StringTokenizer itr = new StringTokenizer(cleanLine);

Integer count = 0;
while (itr.hasMoreTokens()) {
String word_ = itr.nextToken();
count += 1;
// 非停用词
if(stopWord.get(word_) == null) {
word.set(word_);
file.set(file_name + "-" + line_off.toString() + "-" + count.toString());
context.write(word, file);
}

}
}

最终的倒排表结构如下:

1
2
3
4
5
6
aaron   titusandronicus-2984-5, titusandronicus-2379-1, titusandronicus-919-1, titusandronicus-2531-1, titusandronicus-3069-1, titusandronicus-2394-1, titusandronicus-812-5, titusandronicus-1800-1, titusandronicus-2534-7, titusandronicus-3477-7, titusandronicus-924-2, titusandronicus-926-1, titusandronicus-912-1, titusandronicus-3492-1, titusandronicus-3053-1, titusandronicus-936-1, titusandronicus-2523-1, titusandronicus-2407-7, titusandronicus-2409-1, titusandronicus-2410-2, titusandronicus-808-2, titusandronicus-2410-7, titusandronicus-3499-4, titusandronicus-2412-4, titusandronicus-3073-1, titusandronicus-2519-4, titusandronicus-1113-1, titusandronicus-2540-1, titusandronicus-3048-1, titusandronicus-2415-1, titusandronicus-838-1, titusandronicus-48-1, titusandronicus-2422-1, titusandronicus-1440-1, titusandronicus-2548-6, titusandronicus-2517-2, titusandronicus-3116-1, titusandronicus-1784-2, titusandronicus-2426-1, titusandronicus-1089-1, titusandronicus-2358-1, titusandronicus-900-1, titusandronicus-2430-1, titusandronicus-3037-1, titusandronicus-2504-1, titusandronicus-1780-8, titusandronicus-1296-6, titusandronicus-3112-1, titusandronicus-2550-1, titusandronicus-3141-1, titusandronicus-3027-1, titusandronicus-797-1, titusandronicus-795-2, titusandronicus-183-4, titusandronicus-3025-5, titusandronicus-3732-5, titusandronicus-1351-2, titusandronicus-897-2, titusandronicus-3018-1, titusandronicus-854-1, titusandronicus-2438-1, titusandronicus-2387-1, titusandronicus-1348-1, titusandronicus-1853-1, titusandronicus-1772-1, titusandronicus-1770-2, titusandronicus-2490-1, titusandronicus-3081-1, titusandronicus-3744-1, titusandronicus-1334-1, titusandronicus-2833-6, titusandronicus-1824-5, titusandronicus-2312-5, titusandronicus-1389-4, titusandronicus-873-1, titusandronicus-2443-1, titusandronicus-1827-1, titusandronicus-2584-1, titusandronicus-2447-1, titusandronicus-1845-1, titusandronicus-1310-3, titusandronicus-1074-5, titusandronicus-1312-1, titusandronicus-657-6, titusandronicus-3762-5, titusandronicus-494-3, titusandronicus-2320-1, titusandronicus-2455-1, titusandronicus-1068-4, titusandronicus-891-1, titusandronicus-3086-1, titusandronicus-2457-2, titusandronicus-2459-1, titusandronicus-1837-2, titusandronicus-1053-1, titusandronicus-2465-1, titusandronicus-1051-2, titusandronicus-2575-2
abaissiez kinghenryv-4547-6
abandon 3kinghenryvi-415-5, tamingoftheshrew-453-5, troilusandcressida-2712-3, twelfthnight-482-5, othello-2857-4, timonofathens-3580-8, asyoulikeit-4082-8, asyoulikeit-3493-3, asyoulikeit-3490-2, asyoulikeit-1001-3
abash troilusandcressida-791-5
abate kinghenryv-1622-1, kinghenryv-1621-1, kinghenryv-1621-4, kinghenryv-828-10, kinghenryv-3413-7, hamlet-4823-9, titusandronicus-137-6, loveslabourslost-3738-1, tamingoftheshrew-278-3, venusandadonis-792-6, kingrichardiii-5768-1, merchantofvenice-3738-3, glossary-151-2, glossary-3-1, midsummersnightsdream-2052-1, cymbeline-654-1, romeoandjuliet-3603-1
abated 2kinghenryiv-358-5, kinglear-2302-3, coriolanus-3661-1

实验总结及收获:

  1. How long do you think this project would take you to finish?

一天

  1. How much time did you actually spend on this project?

两天

  1. Acknowledge any assistance you received from anyone except assigned course readings and the course staff.

Hadoop中的输入数据划分格式等细节问题通过google获得,通过全局变量记录行号方式以获取a word’s position in the document的方式由计41 李永斌告知。

  1. What test corpus did you load into HDFS? Where is it located (i.e. HDFS path)?

课程提供的莎士比亚全集,位置: /user/alexzhangch/inverted_index/input

  1. What, if any, “noisy words” did you find as a result of your WordCount? By what criteria did you determine that they were “noisy”? Were you surprised by any of these words?

噪声单词通过Word count 方式进行统计词频,然后使用python绘制对应的分布情况,然后结合对应的单词确定以词频数超过300为界限划分噪声词

  1. Which extensions did you do? Please also detail any interesting results you found.

    • 实现的扩展功能
      • 数据清洗,去除标点符号、特殊符号,不区分大小写等
      • 完全倒排索引,通过行号+在改行的出现位置作为单词的position
    • interesting results
      • MAP阶段每一个输入文件由一个Map实例处理
      • TextInputFormat输入时,key是改行的byte offset,不是行号,但是由于Map实例中会一行一行读取然后并行执行map函数,因此可以使用一个全局变量去记录当前map函数处理的行号,从而去定位单词在文档中的位置
  2. 收获

熟悉了Hadoop的MapReduce编程,对其实现机制有了更深入的理解,对MapReduce过程中的数据流通过程有了更多的认识,特别是输入数据的切分

打赏