Sentence Similarity
English
txtai

txtchat.data.wikipedia.index crashes when dataset only has one data arrow file

#3
by ymcki - opened

I downloaded a wiki dump of zh-yue language that is only 116MB large and only one data arrow file.
https://dumps.wikimedia.org/zh_yuewiki/20241201/

I found that it crashes at txtchat.data.wikipedia.index. For zh and ja that has multiple data arrow files, it worked. It would be great if it is fixed for
small languages as well.

time python3 -m txtchat.data.wikipedia.index -d wikipedia-zh_yue-20170720 -o txtai-zh_yue-wikipedia -v pageviews/pageviews-zh_yue.sqlite
Process Process-1:
Traceback (most recent call last):
File "/home/user/anaconda3/envs/ai/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/home/user/anaconda3/envs/ai/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/user/anaconda3/envs/ai/lib/python3.10/site-packages/txtchat/data/wikipedia/index.py", line 55, in call
title = row["title"]
KeyError: 'title'

Sign up or log in to comment