[高级数据结构] Week 3 Note

Inverted File Index

【倒排文件索引】

Term-Document Incidence Matrix

【术语文档关联矩阵】

用高维向量进行表征，用以存储文本分析。

只需要检索关键字，只需要二位与操作就可以得出silver&truck。 1010&1010 = 1010 第一篇、第三篇文章有 silver truck。

增加词频作为计数参考。
删去常用词存储，关注罕见词汇。
TF/DF 文档中的频率/总文档的频率（的对数），体现出文档的关联性。

Compact Version – Inverted File Index

【倒排索引】

索引是一种在文本中定位给定术语的机制。倒排文件索引包含指向该术语在文本中所有出现位置的指针列表（例如，页数）。

不按文章存储关键字，而按关键子存储文章内关键字所在的指针。

Term Dictionary 词汇表

Postinng List 文本及在文本出现的位置

伪代码：

while ( read a document D ) {
    while ( read a term T in D ) {
        if ( Find( Dictionary, T ) == false )
            Insert( Dictionary, T );
        Get T’s posting list;
        Insert a node to T’s posting list;
    }
}
Write the inverted index to disk;

Token Analyzer 分词
Stop Filter 停用词过滤
Vocabulary Scanner
Vocabulary Insertor
Write the inverted index to disk – Memory management

存词

Word Streamming
- 英语一个词语有不同的形式，如何存储？
- 将多个形式映射到同一个词
Stop Words
- 不代表明确的意思
- 没有文本区分能力
- eg: it、the、a

搜索 Search trees & Hashing

第一，散列表中的数据是无序存储的，如果要输出有序的数据，需要先进行排序。而对于二叉查找树来说，我们只需要中序遍历，就可以在 O(n) 的时间复杂度内，输出有序的数据序列。
第二，散列表扩容耗时很多，而且当遇到散列冲突时，性能不稳定，尽管二叉查找树的性能不稳定，但是在工程中，我们最常用的平衡二叉查找树的性能非常稳定，时间复杂度稳定在 O(logn)。
第三，笼统地来说，尽管散列表的查找等操作的时间复杂度是常量级的，但因为哈希冲突的存在，这个常量不一定比 logn 小，所以实际的查找速度可能不一定比 O(logn) 快。加上哈希函数的耗时，也不一定就比平衡二叉查找树的效率高。
第四，散列表的构造比二叉查找树要复杂，需要考虑的东西很多。比如散列函数的设计、冲突解决办法、扩容、缩容等。平衡二叉查找树只需要考虑平衡性这一个问题，而且这个问题的解决方案比较成熟、固定。
最后，为了避免过多的散列冲突，散列表装载因子不能太大，特别是基于开放寻址法解决冲突的散列表，不然会浪费一定的存储空间。

内存管理

BlockCnt = 0; 
while ( read a document D ) {
  while ( read a term T in D ) {
    if ( out of memory ) {
      Write BlockIndex[BlockCnt] to disk;
      BlockCnt ++;
      FreeMemory;
    }
    if ( Find( Dictionary, T ) == false )
      Insert( Dictionary, T );
    Get T’s posting list;
    Insert a node to T’s posting list;
  }
}

先排好序，再插入进去。

Q:操作系统就会有内存管理，为什么要在倒排索引中再做一次？

A:操作系统只能看到大致信息，但是应用可以根据信息相关语义、词汇频率来优化。

分配存储

Solution 1 : 按词字
Solution 2 : 按文章

动态索引

文档增加
- auxiliary index 辅助索引-> 归并索引
文档删除
- Lazy deletion

压缩 Compression

对 Posting List 压缩。

存储差分文本号，存相对值而不是绝对值。

阈值化 Thresholding

按查询项的频率升序排序；仅按原始查询项的某个百分比进行搜索。

优先搜索过滤强的关键词。

Data Retrieval Performance Evaluation

Response time
Index space

Information Retriveval Performance

> How relevant is the answer set?

Relevance 相关性

Retriveval 检索的

Precision 精确度

Recall 召回率

[高级数据结构] Week 3 Note

于2020年3月15日2020年3月15日由咪啪发布

Inverted File Index

Term-Document Incidence Matrix

Compact Version – Inverted File Index

存词

搜索 Search trees & Hashing

内存管理

分配存储

动态索引

压缩 Compression

阈值化 Thresholding

Data Retrieval Performance Evaluation

Information Retriveval Performance

0 条评论

发表评论取消回复

Like A programmer

[ BASH ] 帮同事姐姐批量修改文件名后缀

Like A programmer

Sublime 全局跳转失灵之后

Like A programmer

[Web] 试试Parcel

[高级数据结构] Week 3 Note

于2020年3月15日2020年3月15日由咪啪发布

Inverted File Index

Term-Document Incidence Matrix

Compact Version – Inverted File Index

存词

搜索 Search trees & Hashing

内存管理

分配存储

动态索引

压缩 Compression

阈值化 Thresholding

Data Retrieval Performance Evaluation

Information Retriveval Performance

0 条评论

发表评论 取消回复

相关文章

Like A programmer

[ BASH ] 帮同事姐姐批量修改文件名后缀

Like A programmer

Sublime 全局跳转失灵之后

Like A programmer

[Web] 试试Parcel

发表评论取消回复