当前位置：首页 > C++ > 正文

C++信息检索算法入门（从零构建倒排索引搜索引擎）

主机测评网
C++
2025-12-24
786

在当今大数据时代，C++信息检索算法是构建高效搜索引擎和文本分析系统的核心技术之一。本教程将带你从零开始，用 C++ 实现一个基础但功能完整的倒排索引（Inverted Index）系统——这是现代搜索引擎如 Google、百度等背后的关键数据结构。

C++信息检索算法入门（从零构建倒排索引搜索引擎） C++信息检索算法倒排索引实现 C++搜索引擎开发文本检索系统第1张

什么是信息检索？

信息检索（Information Retrieval, IR）是指从大量非结构化或半结构化文本中查找与用户查询相关的信息的过程。最常见的应用就是搜索引擎：你输入“C++教程”，它返回相关网页。

而要高效完成这一任务，就需要一种能快速定位关键词出现在哪些文档中的数据结构——这就是倒排索引。

倒排索引原理

传统索引是“文档 → 关键词”（正向索引），而倒排索引是“关键词 → 文档列表”。例如：

apple → [doc1, doc3]
banana → [doc2, doc3]
cherry → [doc1]

这样，当用户搜索“apple banana”时，系统只需查找两个词对应的文档列表并取交集，即可快速得到结果。

用 C++ 实现倒排索引

下面我们将用 C++ 编写一个简单的倒排索引系统。我们将使用标准库中的 std::unordered_map 和 std::set 来存储索引。

步骤 1：定义数据结构

#include <iostream>#include <unordered_map>#include <set>#include <vector>#include <string>#include <sstream>// 倒排索引：关键词 -> 包含该词的文档ID集合typedef std::unordered_map<std::string, std::set<int>> InvertedIndex;

步骤 2：添加文档到索引

我们需要一个函数，将一段文本按空格分词，并将每个词关联到当前文档 ID。

void addDocument(InvertedIndex& index, int docId, const std::string& text) {    std::istringstream iss(text);    std::string word;    while (iss >> word) {        // 简单转为小写（实际项目中需更完善的分词和归一化）        for (auto& c : word) c = std::tolower(c);        index[word].insert(docId);    }}

步骤 3：执行查询

查询多个关键词时，我们取它们文档集合的交集（AND 查询）。

std::set<int> search(const InvertedIndex& index, const std::vector<std::string>& words) {    if (words.empty()) return {};    std::set<int> result = index.at(words[0]); // 初始化为第一个词的结果    for (size_t i = 1; i < words.size(); ++i) {        std::set<int> current = index.count(words[i]) ? index.at(words[i]) : std::set<int>{};        std::set<int> intersection;        std::set_intersection(result.begin(), result.end(),                              current.begin(), current.end(),                              std::inserter(intersection, intersection.begin()));        result = intersection;        if (result.empty()) break; // 提前终止    }    return result;}

步骤 4：完整示例程序

int main() {    InvertedIndex index;    // 添加三个文档    addDocument(index, 1, "apple banana cherry");    addDocument(index, 2, "banana date");    addDocument(index, 3, "apple banana elderberry");    // 查询 "apple banana"    std::vector<std::string> query = {"apple", "banana"};    auto results = search(index, query);    std::cout << "匹配文档ID: ";    for (int id : results) {        std::cout << id << " ";    }    std::cout << std::endl; // 输出: 1 3    return 0;}