本文最后更新于 1123 天前,其中的信息可能已经有所发展或是发生改变。
词库扩容
静态更新
- 新增词库文件
cat > config/analysis-ik/new_word.dic << EOF
海底捞
西红柿炒鸡蛋
青岛啤酒
EOF
- 修改配置文件
vi config/analysis-ik/IKAnalyzer.cfg.xml
# <entry key="ext_dict">new_word.dic</entry>
- 重启服务
- 重新构建索引
POST /shop/_update_by_query
{
"query": {
"bool": {
"must": [
{"term": {"name": "海"}},
{"term": {"name": "底"}},
{"term": {"name": "捞"}}
]
}
}
}
动态更新
vi config/analysis-ik/IKAnalyzer.cfg.xml
# <entry key="ext_dict">http://yoursite.com/getCustomDict</entry>
此http请求需要两个响应头(Last-Modified,ETag),分词器每分钟请求一次,这两个字符串只要其中一个发生变化,插件便会抓取新词构建索引。
同义词
- 新增同义词文件
cat > config/analysis-ik/new_synonym.dic << EOF
西红柿炒鸡蛋,番茄炒鸡蛋,番茄炒蛋
炖菜,烩菜
EOF
- 重新构建索引
PUT /shop?include_type_name=false
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 1,
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms_path": "analysis-ik/new_synonym.dic"
}
},
"analyzer": {
"ik_syno": {
"type": "custom",
"tokenizer": "ik_smart",
"filter": ["my_synonym_filter"]
},
"ik_syno_max": {
"type": "custom",
"tokenizer": "ik_max_word",
"filter": ["my_synonym_filter"]
}
}
}
},
"mappings": {
"properties": {
"id": {"type": "integer"},
"name": {
"type": "text",
"analyzer": "ik_syno_max",
"search_analyzer": "ik_syno"
},
"tags": {
"type": "text",
"analyzer": "whitespace",
"fielddata": true
},
"location": {"type": "geo_point"},
"remark_score": {"type": "geo_point"},
"price_per_man": {"type": "integer"},
"category_id": {"type": "integer"},
"category_name": {"type": "keywork"},
"seller_id": {"type": "integer"},
"seller_remark_score": {"type": "double"},
"seller_disabled_flag": {"type": "integer"}
}
}
}
相关性重铸
GET /shop/_search
{
"_source": "*",
"script_fields": {
"distance": {
"script": {
"source": "haversin(lat, lon, doc['location'].lat, doc['location'].lon)",
"lang": "expression",
"params": {"lat": 39.90, "lon": 116.38}
}
}
},
"query": {
"function_score": {
"query": {
"bool": {
"must": [
{
"bool": {
"should": [
{"match": {"name": {"query": "炖菜", "boost": 0.1}}}, //boost得分权重
{"term": {"category_id": {"value": 1, "boost": 0.1}}} //影响召回
]
}
},{
"term": {"seller_disabled_flag": 0}
}
]
}
},
"functions": [
{
"gauss": { //衰减函数:gauss(高斯),exp(指数),lin(线性)
"location": {
"origin": "39.90,116.38",
"offset": "7km",
"scale": "13km",
"decay": 0.4
}
},
"weight": 2.5
},{
"field_value_factor": {"field": "remark_score"},
"weight": 0.2
},{
"field_value_factor": {"field": "seller_remark_score"},
"weight": 0.1
},{
"filter": {"term": {"category_id": 1}}, //影响排序
"weight": 0.2
}
],
"score_mode": "sum",
"boost_mode": "sum"
}
},
"sort": [
{"_score": {"order": "desc"}}
],
"aggs": {
"group_by_tags": {
"terms": {"field": "tags"}
}
}
}
查全率和查准率
查全率:假如满足条件的文档有100个,但通过查询只得到40个,那么查全率仅为40%。
查准率:假如查询得到的文档有40个,其中只有20个是正确的,那么查准率仅为50%。
注意:二者不可兼得,忽略召回可以提高查准率,但是降低了查全率。忽略排序可以提高查全率,但是降低查准率。可以先使用排序查询一次,若结果不尽人意再引入召回查询一次。