定制 IK 分词器
本文最后更新于 1123 天前,其中的信息可能已经有所发展或是发生改变。

词库扩容

静态更新

  1. 新增词库文件
   cat > config/analysis-ik/new_word.dic << EOF
   海底捞
   西红柿炒鸡蛋
   青岛啤酒
   EOF
  1. 修改配置文件
   vi config/analysis-ik/IKAnalyzer.cfg.xml
   # <entry key="ext_dict">new_word.dic</entry>
  1. 重启服务
  2. 重新构建索引
   POST /shop/_update_by_query
   {
       "query": {
           "bool": {
               "must": [
                   {"term": {"name": "海"}},
                   {"term": {"name": "底"}},
                   {"term": {"name": "捞"}}
               ]
           }
       }
   }

动态更新

vi config/analysis-ik/IKAnalyzer.cfg.xml
# <entry key="ext_dict">http://yoursite.com/getCustomDict</entry>

此http请求需要两个响应头(Last-Modified,ETag),分词器每分钟请求一次,这两个字符串只要其中一个发生变化,插件便会抓取新词构建索引。

同义词

  1. 新增同义词文件
   cat > config/analysis-ik/new_synonym.dic << EOF
   西红柿炒鸡蛋,番茄炒鸡蛋,番茄炒蛋
   炖菜,烩菜
   EOF
  1. 重新构建索引
   PUT /shop?include_type_name=false
   {
       "settings": {
           "number_of_shards": 1,
           "number_of_replicas": 1,
           "analysis": {
               "filter": {
                   "my_synonym_filter": {
                       "type": "synonym",
                       "synonyms_path": "analysis-ik/new_synonym.dic"
                   }
               },
               "analyzer": {
                   "ik_syno": {
                       "type": "custom",
                       "tokenizer": "ik_smart",
                       "filter": ["my_synonym_filter"]
                   },
                   "ik_syno_max": {
                       "type": "custom",
                       "tokenizer": "ik_max_word",
                       "filter": ["my_synonym_filter"]
                   }
               }
           }
       },
       "mappings": {
           "properties": {
               "id": {"type": "integer"},
               "name": {
                   "type": "text",
                   "analyzer": "ik_syno_max",
                   "search_analyzer": "ik_syno"
               },
               "tags": {
                   "type": "text",
                   "analyzer": "whitespace",
                   "fielddata": true
               },
               "location": {"type": "geo_point"},
               "remark_score": {"type": "geo_point"},
               "price_per_man": {"type": "integer"},
               "category_id": {"type": "integer"},
               "category_name": {"type": "keywork"},
               "seller_id": {"type": "integer"},
               "seller_remark_score": {"type": "double"},
               "seller_disabled_flag": {"type": "integer"}
           }
       }
   }

相关性重铸

GET /shop/_search
{
    "_source": "*",
    "script_fields": {
        "distance": {
            "script": {
                "source": "haversin(lat, lon, doc['location'].lat, doc['location'].lon)",
                "lang": "expression",
                "params": {"lat": 39.90, "lon": 116.38}
            }
        }
    },
    "query": {
        "function_score": {
            "query": {
                "bool": {
                    "must": [
                        {
                            "bool": {
                                "should": [
                                    {"match": {"name": {"query": "炖菜", "boost": 0.1}}}, //boost得分权重
                                    {"term": {"category_id": {"value": 1, "boost": 0.1}}} //影响召回
                                ]
                            }
                        },{
                            "term": {"seller_disabled_flag": 0}
                        }
                    ]
                }
            },
            "functions": [
                {
                    "gauss": { //衰减函数:gauss(高斯),exp(指数),lin(线性)
                        "location": {
                            "origin": "39.90,116.38",
                            "offset": "7km",
                            "scale": "13km",
                            "decay": 0.4
                        }
                    },
                    "weight": 2.5
                },{
                    "field_value_factor": {"field": "remark_score"},
                    "weight": 0.2
                },{
                    "field_value_factor": {"field": "seller_remark_score"},
                    "weight": 0.1
                },{
                    "filter": {"term": {"category_id": 1}}, //影响排序
                    "weight": 0.2
                }
            ],
            "score_mode": "sum",
            "boost_mode": "sum"
        }
    },
    "sort": [
        {"_score": {"order": "desc"}}
    ],
    "aggs": {
        "group_by_tags": {
            "terms": {"field": "tags"}
        }
    }
}

查全率和查准率

查全率:假如满足条件的文档有100个,但通过查询只得到40个,那么查全率仅为40%。

查准率:假如查询得到的文档有40个,其中只有20个是正确的,那么查准率仅为50%。

注意:二者不可兼得,忽略召回可以提高查准率,但是降低了查全率。忽略排序可以提高查全率,但是降低查准率。可以先使用排序查询一次,若结果不尽人意再引入召回查询一次。

如果觉得本文对您有帮助,记得收藏哦~
上一篇
下一篇