Spaces:

mikeee
/

ttw

Running

App Files Files Community

freemt commited on Dec 27, 2021

Commit

601d149

1 Parent(s): c90742c

Update requirements.txt

Browse files

Files changed (34) hide show

data/test_en.txt +69 -0
data/test_zh.txt +74 -0
gradiobee/__init__.py +1 -0
gradiobee/__pycache__/__init__.cpython-37.pyc +0 -0
gradiobee/__pycache__/__init__.cpython-38.pyc +0 -0
gradiobee/__pycache__/cmat2tset.cpython-37.pyc +0 -0
gradiobee/__pycache__/cmat2tset.cpython-38.pyc +0 -0
gradiobee/__pycache__/docterm_scores.cpython-37.pyc +0 -0
gradiobee/__pycache__/docterm_scores.cpython-38.pyc +0 -0
gradiobee/__pycache__/en2zh.cpython-37.pyc +0 -0
gradiobee/__pycache__/en2zh.cpython-38.pyc +0 -0
gradiobee/__pycache__/en2zh_tokens.cpython-37.pyc +0 -0
gradiobee/__pycache__/en2zh_tokens.cpython-38.pyc +0 -0
gradiobee/__pycache__/gen_model.cpython-37.pyc +0 -0
gradiobee/__pycache__/gen_model.cpython-38.pyc +0 -0
gradiobee/__pycache__/insert_spaces.cpython-37.pyc +0 -0
gradiobee/__pycache__/insert_spaces.cpython-38.pyc +0 -0
gradiobee/__pycache__/mdx_e2c.cpython-37.pyc +0 -0
gradiobee/__pycache__/mdx_e2c.cpython-38.pyc +0 -0
gradiobee/__pycache__/plot_df.cpython-38.pyc +0 -0
gradiobee/__pycache__/smatrix.cpython-37.pyc +0 -0
gradiobee/__pycache__/smatrix.cpython-38.pyc +0 -0
gradiobee/cmat2tset.py +59 -0
gradiobee/docterm_scores.py +96 -0
gradiobee/en2zh.py +40 -0
gradiobee/en2zh_tokens.py +28 -0
gradiobee/gen_model.py +115 -0
gradiobee/insert_spaces.py +14 -0
gradiobee/mdx_dict_e2c.lzma +0 -0
gradiobee/mdx_e2c.py +40 -0
gradiobee/plot_df.py +98 -0
gradiobee/smatrix.py +100 -0
pyrightconfig.json +11 -0
requirements.txt +7 -1

data/test_en.txt ADDED Viewed

	@@ -0,0 +1,69 @@

+Wuthering Heights
+--------------------------------------------------------------------------------
+Chapter 2
+ Chinese
+Yesterday afternoon set in misty and cold. I had half a mind to spend it by my study fire, instead of wading through heath and mud to Wuthering Heights. On coming up from dinner, however (N.B. I dine between twelve and one o'clock; the housekeeper, a matronly lady, taken as a fixture along with the house, could not, or would not, comprehend my request that I might be served at five), on mounting the stairs with this lazy intention, and stepping into the room, I saw a servant girl on her knees surrounded by brushes and coal-scuttles, and raising an infernal dust as she extinguished the flames with heaps of cinders. This spectacle drove me back immediately; I took my hat, and, after a four-miles' walk, arrived at Heathcliff's garden gate just in time to escape the first feathery flakes of a snow shower.
+On that bleak hill top the earth was hard with a black frost, and the air made me shiver through every limb. Being unable to remove the chain, I jumped over, and, running up the flagged causeway bordered with straggling gooseberry bushes, knocked vainly for admittance, till my knuckles tingled and the dogs howled.
+`Wretched inmates!' I ejaculated mentally, `you deserve perpetual isolation from your species for your churlish inhospitality. At least, I would not keep my doors barred in the day time. I don't care--I will get in!' So resolved, I grasped the latch and shook it vehemently. Vinegar-faced Joseph projected his head from a round window of the barn.
+`Whet are ye for?' he shouted. `T' maister's dahn i' t' fowld. Go rahnd by th' end ut' laith, if yah went tuh spake tull him.'
+`Is there nobody inside to open the door?' I hallooed, responsively.
+`They's nobbut t' missis; and shoo'll nut oppen't an ye mak yer flaysome dins till neeght.'
+`Why? Cannot you tell her who I am, eh, Joseph?'
+`Nor-ne me! Aw'll hae noa hend wi't,' muttered the head, vanishing.
+The snow began to drive thickly. I seized the handle to essay another trial; when a young man without coat, and shouldering a pitchfork, appeared in the yard behind. He hailed me to follow him, and, after marching through a wash-house, and a paved area containing a coal shed, pump, and pigeon cot, we at length arrived in the huge, warm, cheerful apartment, where I was formerly received. It glowed delightfully in the radiance of an immense fire, compounded of coal, peat, and wood; and near the table, laid for a plentiful evening meal, I was pleased to observe the `missis', an individual whose existence I had never previously suspected. I bowed and waited, thinking she would bid me take a seat. She looked at me, leaning back in her chair, and remained motionless and mute.
+`Rough weather!' I remarked. `I'm afraid, Mrs Heathcliff, the door must bear the consequence of your servants' leisure attendance: I had hard work to make them hear me.'
+She never opened her mouth. I stared--she stared also: at any rate, she kept her eyes on me in a cool, regardless manner, exceedingly embarrassing and disagreeable.
+`Sit down,' said the young man gruffly. `He'll be in soon.'
+I obeyed; and hemmed, and called the villain Juno, who deigned, at this second interview, to move the extreme tip of her tail, in token of owning my acquaintance.
+`A beautiful animal!' I commenced again. `Do you intend parting with the little ones, madam?'
+`They are not mine,' said the amiable hostess, more repellingly than Heathcliff himself could have replied.
+`Ah, your favourites are among these?' I continued, turning to an obscure cushion full of something like cats.
+`A strange choice of favourites!' she observed scornfully.
+Unluckily, it was a heap of dead rabbits. I hemmed once more, and drew closer to the hearth, repeating my comment on the wildness of the evening.
+`You should not have come out,' she said, rising and reaching from the chimney-piece two of the painted canisters.
+Her position before was sheltered from the light; now, I had a distinct view of her whole figure and countenance. She was slender, and apparently scarcely past girlhood: an admirable form, and the most exquisite little face that I have ever had the pleasure of beholding; small features, very fair; flaxen ringlets, or rather golden, hanging loose on her delicate neck; and eyes, had they been agreeable in expression, they would have been irresistible: fortunately for my susceptible heart, the only sentiment they evinced hovered between scorn, and a kind of desperation, singularly unnatural to be detected there. The canisters were almost out of her reach; I made a motion to aid her; she turned upon me as a miser might turn if anyone attempted to assist him in counting his gold.
+`I don't want your help,' she snapped; `I can get them for myself.'
+`I beg your pardon!' I hastened to reply.
+`Were you asked to tea?' she demanded, tying an apron over her neat black frock, and standing with a spoonful of the leaf poised over the pot.
+`I shall be glad to have a cup,' I answered.
+`Were you asked?' she repeated.
+`No,' I said, half smiling. `You are the proper person to ask me.'
+Contents PreviousChapter
+ NextChapter
+Homepage

data/test_zh.txt ADDED Viewed

	@@ -0,0 +1,74 @@

+呼啸山庄
+--------------------------------------------------------------------------------
+第二章
+ 英文
+昨天下午又冷又有雾。我想就在书房炉边消磨一下午，不想踩着杂草污泥到呼啸山庄了。
+但是，吃过午饭（注意——我在十二点与一点钟之间吃午饭，而可以当作这所房子的附属物的管家婆，一位慈祥的太太却不能，或者并不愿理解我请求在五点钟开饭的用意），在我怀着这个懒惰的想法上了楼，迈进屋子的时候，看见一个女仆跪在地上，身边是扫帚和煤斗。她正在用一堆堆煤渣封火，搞起一片弥漫的灰尘。这景象立刻把我赶回头了。我拿了帽子，走了四里路，到达了希刺克厉夫的花园口口，刚好躲过了一场今年初降的鹅毛大雪。
+在那荒凉的山顶上，土地由于结了一层黑冰而冻得坚硬，冷空气使我四肢发抖。我弄不开门链，就跳进去，顺着两边种着蔓延的醋栗树丛的石路跑去。我白白地敲了半天门，一直敲到我的手指骨都痛了，狗也狂吠起来。
+“倒霉的人家！”我心里直叫，“只为你这样无礼待客，就该一辈子跟人群隔离。我至少还不会在白天把门闩住。我才不管呢——我要进去！”如此决定了。我就抓住门闩，使劲摇它。苦脸的约瑟夫从谷仓的一个圆窗里探出头来。
+“你干吗？”他大叫。“主人在牛栏里，你要是找他说话，就从这条路口绕过去。”
+“屋里没人开门吗？”我也叫起来。
+“除了太太没有别人。你就是闹腾到夜里，她也不会开。”
+“为什么？你就不能告诉她我是谁吗，呃，约瑟夫？”
+“别找我！我才不管这些闲事呢，”这个脑袋咕噜着，又不见了。
+雪开始下大了。我握住门柄又试一回。这时一个没穿外衣的年轻人，扛着一根草耙，在后面院子里出现了。他招呼我跟着他走，穿过了一个洗衣房和一片铺平的地，那儿有煤棚、抽水机和鸽笼，我们终于到了我上次被接待过的那间温暖的、热闹的大屋子。煤、炭和木材混合在一起燃起的熊熊炉火，使这屋子放着光彩。在准备摆上丰盛晚餐的桌旁，我很高兴地看到了那位“太太”，以前我从未料想到会有这么一个人存在的。我鞠躬等候，以为她会叫我坐下。她望望我，往她的椅背一靠，不动，也不出声。
+“天气真坏！”我说，“希刺克厉夫太太，恐怕大门因为您的仆人偷懒而大吃苦头，我费了好大劲才使他们听见我敲门！”
+她死不开口。我瞪眼——她也瞪眼。反正她总是以一种冷冷的、漠不关心的神气盯住我，使人十分窘，而且不愉快。
+“坐下吧，”那年轻人粗声粗气地说，“他就要来了。”
+我服从了；轻轻咳了一下，叫唤那恶狗朱诺。临到第二次会面，它总算赏脸，摇起尾巴尖，表示认我是熟人了。
+“好漂亮的狗！”我又开始说话。“您是不是打算不要这些小的呢，夫人？”
+“那些不是我的，”这可爱可亲的女主人说，比希刺克厉夫本人所能回答的腔调还要更冷淡些。
+“啊，您所心爱的是在这一堆里啦！”我转身指着一个看不清楚的靠垫上那一堆像猫似的东西，接着说下去。
+“谁会爱这些东西那才怪呢！”她轻蔑地说。
+倒霉，原来那是堆死兔子。我又轻咳一声，向火炉凑近些，又把今晚天气不好的话评论一通。
+“你本来就不该出来。”她说，站起来去拿壁炉台上的两个彩色茶叶罐。
+她原先坐在光线被遮住的地方，现在我把她的全身和面貌都看得清清楚楚。她苗条，显然还没有过青春期。挺好看的体态，还有一张我生平从未有幸见过的绝妙的小脸蛋。五官纤丽，非常漂亮。淡黄色的卷发，或者不如说是金黄色的，松松地垂在她那细嫩的颈上。至于眼睛，要是眼神能显得和悦些，就要使人无法抗拒了。对我这容易动情的心说来倒是常事，因为它们所表现的只是在轻蔑与近似绝望之间的一种情绪，而在那张脸上看见那样的眼神是特别不自然的。
+她简直够不到茶叶罐。我动了一动，想帮她一下。她猛地扭转身向我，像守财奴看见别人打算帮他数他的金子一样。
+“我不要你帮忙，”她怒气冲冲地说，“我自己拿得到。”
+“对不起！”我连忙回答。
+“是请你来吃茶的吗？”她问，把一条围裙系在她那干净的黑衣服上，就这样站着，拿一匙茶叶正要往茶壶里倒。
+“我很想喝杯茶。”我回答。
+“是请你来的吗？”她又问。
+“没有，”我说，勉强笑一笑。“您正好请我喝茶。”
+目录
+ 上一章
+ 下一章
+返回首页

gradiobee/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Init."""

gradiobee/__pycache__/__init__.cpython-37.pyc ADDED Viewed

Binary file (179 Bytes). View file

gradiobee/__pycache__/__init__.cpython-38.pyc ADDED Viewed

Binary file (131 Bytes). View file

gradiobee/__pycache__/cmat2tset.cpython-37.pyc ADDED Viewed

Binary file (1.56 kB). View file

gradiobee/__pycache__/cmat2tset.cpython-38.pyc ADDED Viewed

Binary file (1.51 kB). View file

gradiobee/__pycache__/docterm_scores.cpython-37.pyc ADDED Viewed

Binary file (2.64 kB). View file

gradiobee/__pycache__/docterm_scores.cpython-38.pyc ADDED Viewed

Binary file (2.61 kB). View file

gradiobee/__pycache__/en2zh.cpython-37.pyc ADDED Viewed

Binary file (989 Bytes). View file

gradiobee/__pycache__/en2zh.cpython-38.pyc ADDED Viewed

Binary file (947 Bytes). View file

gradiobee/__pycache__/en2zh_tokens.cpython-37.pyc ADDED Viewed

Binary file (1.1 kB). View file

gradiobee/__pycache__/en2zh_tokens.cpython-38.pyc ADDED Viewed

Binary file (1.06 kB). View file

gradiobee/__pycache__/gen_model.cpython-37.pyc ADDED Viewed

Binary file (4.7 kB). View file

gradiobee/__pycache__/gen_model.cpython-38.pyc ADDED Viewed

Binary file (4.67 kB). View file

gradiobee/__pycache__/insert_spaces.cpython-37.pyc ADDED Viewed

Binary file (646 Bytes). View file

gradiobee/__pycache__/insert_spaces.cpython-38.pyc ADDED Viewed

Binary file (602 Bytes). View file

gradiobee/__pycache__/mdx_e2c.cpython-37.pyc ADDED Viewed

Binary file (945 Bytes). View file

gradiobee/__pycache__/mdx_e2c.cpython-38.pyc ADDED Viewed

Binary file (901 Bytes). View file

gradiobee/__pycache__/plot_df.cpython-38.pyc ADDED Viewed

Binary file (2.33 kB). View file

gradiobee/__pycache__/smatrix.cpython-37.pyc ADDED Viewed

Binary file (2.68 kB). View file

gradiobee/__pycache__/smatrix.cpython-38.pyc ADDED Viewed

Binary file (2.65 kB). View file

gradiobee/cmat2tset.py ADDED Viewed

	@@ -0,0 +1,59 @@

+"""Gen triple-set from a  matrix."""
+from typing import List, Tuple, Union
+import numpy as np
+import pandas as pd
+# fmt: off
+def cmat2tset(
+    cmat1: Union[List[List[float]], np.ndarray, pd.DataFrame],
+    # thirdcol: bool = True
+# ) -> List[Union[Tuple[int, int], Tuple[int, int, float]]]:
+) -> np.ndarray:
+    # fmt: on
+    """Gen triple-set from a matrix.
+    Args
+        cmat: 2d-array or list, correlation or other metric matrix
+        # thirdcol: bool, whether to output a third column (max value)
+    Returns
+        Obtain the max and argmax for each column, erase the row afterwards to eliminate one single row  that would dominate
+        every column.
+    """
+    # if isinstance(cmat, list):
+    cmat = np.array(cmat1)
+    if not np.prod(cmat.shape):
+        raise SystemError("data not 2d...")
+    _ = """
+    # y00 = range(cmat.shape[1])  # cmat.shape[0] long time wasting bug
+    yargmax = cmat.argmax(axis=0)
+    if thirdcol:
+        ymax = cmat.max(axis=0)
+        res = [*zip(y00, yargmax, ymax)]  # type: ignore
+        # to unzip
+        # a, b, c = zip(*res)
+        return res
+    _ = [*zip(y00, yargmax)]  # type: ignore
+    return _
+    """
+    low_ = cmat.min() - 1
+    argmax_max = []
+    src_len, tgt_len = cmat.shape
+    for _ in range(min(src_len, tgt_len)):
+        argmax = int(cmat.argmax())
+        row, col = divmod(argmax, tgt_len)
+        argmax_max.append([col, row, cmat.max()])
+        # erase row-th row and col-th col of cmat
+        cmat[row, :] = low_
+        cmat[:, col] = low_
+    return np.array(argmax_max)

gradiobee/docterm_scores.py ADDED Viewed

	@@ -0,0 +1,96 @@

+"""Generate a doc-term score matrix based on textacy.representation.Vectorizer.
+refer also to fast-scores fast_scores.py and gen_model.py (sklearn.feature_extraction.text.TfidfVectorizer).
+"""
+from typing import Dict, Iterable, List, Optional, Union
+import numpy as np
+from itertools import chain
+from psutil import virtual_memory
+from more_itertools import ilen
+from textacy.representations import Vectorizer
+from logzero import logger
+from gradiobee.gen_model import gen_model
+# fmt: off
+def docterm_scores(
+        doc1: Iterable[Iterable[str]],  # List[List[str]],
+        doc2: Iterable[Iterable[str]],
+        model: Vectorizer = None,
+        tf_type: str = 'linear',
+        idf_type: Optional[str] = "smooth",
+        # dl_type: Optional[str] = "sqrt",  # "lucene-style tfidf"
+        dl_type: Optional[str] = None,  #
+        norm: Optional[str] = "l2",  # + "l2"
+        min_df: Union[int, float] = 1,
+        max_df: Union[int, float] = 1.0,
+        max_n_terms: Optional[int] = None,
+        vocabulary_terms: Optional[Union[Dict[str, int], Iterable[str]]] = None
+) -> np.ndarray:
+    # fmt: on
+    """Generate a doc-term score matrix based on textacy.representation.Vectorizer.
+    Args
+        doc1: tokenized doc of n1
+        doc2: tokenized doc of n2
+        model: if None, generate one ad hoc from doc1 and doc2 ("lucene-style tfidf").
+        rest: refer to textacy.representation.Vectorizer
+    Attributes
+        vectorizer
+    Returns
+        n1 x n2 similarity matrix of float numbers
+    """
+    # make sure doc1/doc2 is of the right typing
+    try:
+        for xelm in iter(doc1):
+            for elm in iter(xelm):
+                assert isinstance(elm, str)
+    except AssertionError:
+        raise AssertionError(" doc1 is not of the typing  Iterable[Iterable[str]] ")
+    except Exception as e:
+        logger.error(e)
+        raise
+    try:
+        for xelm in iter(doc2):
+            for elm in iter(xelm):
+                assert isinstance(elm, str)
+    except AssertionError:
+        raise AssertionError(" doc2 is not of the typing  Iterable[Iterable[str]] ")
+    except Exception as e:
+        logger.error(e)
+        raise
+    if model is None:
+        model = gen_model(
+            [*chain(doc1, doc2)],
+            tf_type=tf_type,
+            idf_type=idf_type,
+            dl_type=dl_type,
+            norm=norm,
+            min_df=min_df,
+            max_df=max_df,
+            max_n_terms=max_n_terms,
+            vocabulary_terms=vocabulary_terms
+        )
+        docterm_scores.model = model
+    # a1 = dt.toarray(), a2 = doc_term_matrix.toarray()
+    # np.all(np.isclose(a1, a2))
+    dt1 = model.transform(doc1)
+    dt2 = model.transform(doc2)
+    # virtual_memory().available / 8: 64bits float
+    require_ram = ilen(iter(doc1)) * ilen(iter(doc2)) * 8
+    if require_ram > virtual_memory().available:
+        logger.warning("virtual_memory().available: %s", virtual_memory().available)
+        logger.warning("memory required: %s", require_ram)
+    if require_ram > virtual_memory().available * 10:
+        logger.warning("You'll likely encounter memory problem, such as slow down response and/or OOM.")
+    # return dt1.doc(dt2.T)
+    return dt2.toarray().dot(dt1.toarray().T)

gradiobee/en2zh.py ADDED Viewed

	@@ -0,0 +1,40 @@

+"""Translate english to chinese via a dict."""
+from typing import List, Union
+import warnings
+import copy
+from gradiobee.mdx_e2c import mdx_e2c
+warnings.simplefilter('ignore', DeprecationWarning)
+# fmt: off
+def en2zh(
+        # text: Union[str, List[List[str]]],
+        text: Union[str, List[str]],
+) -> List[str]:
+    # fmt: on
+    """Translate english to chinese via a dict.
+    Args
+        text: to translate, list of str
+    Returns
+        res: list of str
+    """
+    res = copy.deepcopy(text)
+    if isinstance(text, str):
+        # res = [text.split()]
+        res = [text]
+    # if res and isinstance(res[0], str):
+        # res = [line.lower().split() for line in res]
+    # res = ["".join([word_tr(word) for word in line]) for line in res]
+    _ = []
+    for line in res:
+        line_tr = [mdx_e2c(word) for word in line.split()]
+        _.append("".join(line_tr))
+    return _

gradiobee/en2zh_tokens.py ADDED Viewed

	@@ -0,0 +1,28 @@

+"""Translate english to chinese via a dict."""
+from typing import List, Union
+from gradiobee.en2zh import en2zh
+from gradiobee.insert_spaces import insert_spaces
+# fmt: off
+def en2zh_tokens(
+        # text: Union[str, List[List[str]]],
+        text: Union[str, List[str]],
+        dedup: bool = True,
+) -> List[List[str]]:
+    # fmt: on
+    """Translate english to chinese tokens via a dict.
+    Args
+        text: to translate, list of str
+        dedup: if True, remove all duplicates
+    Returns
+        res: list of list of str/token/char
+    """
+    res = en2zh(text)
+    if dedup:
+        return [list(set(insert_spaces(elm).split())) for elm in res]
+    return [insert_spaces(elm).split() for elm in res]

gradiobee/gen_model.py ADDED Viewed

	@@ -0,0 +1,115 @@

+"""Generate a model (textacy.representations.Vectorizer).
+vectorizer = Vectorizer(
+    tf_type="linear", idf_type="smooth", norm="l2",
+    min_df=3, max_df=0.95)
+doc_term_matrix = vectorizer.fit_transform(tokenized_docs)
+doc_term_matrix
+tokenized_docs = [insert_spaces(elm).split() for elm in textzh]
+"""
+from typing import Dict, Iterable, List, Optional, Union
+from textacy.representations import Vectorizer
+from logzero import logger
+# fmt: off
+def gen_model(
+        tokenized_docs: Iterable[Iterable[str]],  # List[List[str]],
+        tf_type: str = 'linear',
+        idf_type: Optional[str] = "smooth",
+        dl_type: str = None,  # Optional[str] = "sqrt" “lucene-style tfidf”
+        norm: Optional[str] = "l2",  # + "l2"
+        min_df: Union[int, float] = 1,
+        max_df: Union[int, float] = 1.0,
+        max_n_terms: Optional[int] = None,
+        vocabulary_terms: Optional[Union[Dict[str, int], Iterable[str]]] = None
+) -> Vectorizer:
+    # fmt: on
+    """Generate a model (textacy.representations.Vectorizer).
+    Args:
+        doc: tokenized docs
+        (refer to textacy.representation.Vectorizer)
+        tf_type: Type of term frequency (tf) to use for weights' local component:
+            - "linear": tf (tfs are already linear, so left as-is)
+            - "sqrt": tf => sqrt(tf)
+            - "log": tf => log(tf) + 1
+            - "binary": tf => 1
+        idf_type: Type of inverse document frequency (idf) to use for weights'
+            global component:
+            - "standard": idf = log(n_docs / df) + 1.0
+            - "smooth": idf = log(n_docs + 1 / df + 1) + 1.0, i.e. 1 is added
+              to all document frequencies, as if a single document containing
+              every unique term was added to the corpus.
+            - "bm25": idf = log((n_docs - df + 0.5) / (df + 0.5)), which is
+              a form commonly used in information retrieval that allows for
+              very common terms to receive negative weights.
+            - None: no global weighting is applied to local term weights.
+        dl_type: Type of document-length scaling to use for weights'
+            normalization component:
+            - "linear": dl (dls are already linear, so left as-is)
+            - "sqrt": dl => sqrt(dl)
+            - "log": dl => log(dl)
+            - None: no normalization is applied to local(*global?) weights
+        norm: If "l1" or "l2", normalize weights by the L1 or L2 norms, respectively,
+            of row-wise vectors; otherwise, don't.
+        min_df: Minimum number of documents in which a term must appear for it to be
+            included in the vocabulary and as a column in a transformed doc-term matrix.
+            If float, value is the fractional proportion of the total number of docs,
+            which must be in [0.0, 1.0]; if int, value is the absolute number.
+        max_df: Maximum number of documents in which a term may appear for it to be
+            included in the vocabulary and as a column in a transformed doc-term matrix.
+            If float, value is the fractional proportion of the total number of docs,
+            which must be in [0.0, 1.0]; if int, value is the absolute number.
+        max_n_terms: If specified, only include terms whose document frequency is within
+            the top ``max_n_terms``.
+        vocabulary_terms: Mapping of unique term string to unique term id, or
+            an iterable of term strings that gets converted into such a mapping.
+            Note that, if specified, vectorized outputs will include *only* these terms.
+        “lucene-style tfidf”: Adds a doc-length normalization to the usual local and global components.
+            Params: tf_type="linear", apply_idf=True, idf_type="smooth", apply_dl=True, dl_type="sqrt"
+        “lucene-style bm25”: Uses a smoothed idf instead of the classic bm25 variant to prevent weights on terms from going negative.
+            Params: tf_type="bm25", apply_idf=True, idf_type="smooth", apply_dl=True, dl_type="linear"
+    Attributes:
+        doc_term_matrix
+    Returns:
+        transform_fit'ted vectorizer
+    """
+    # make sure tokenized_docs is the right typing
+    try:
+        for xelm in iter(tokenized_docs):
+            for elm in iter(xelm):
+                assert isinstance(elm, str)
+    except AssertionError:
+        raise AssertionError(" tokenized_docs is not of the typing  Iterable[Iterable[str]] ")
+    except Exception as e:
+        logger.error(e)
+        raise
+    vectorizer = Vectorizer(
+        # tf_type="linear", idf_type="smooth", norm="l2",  min_df=3, max_df=0.95)
+        tf_type=tf_type,
+        idf_type=idf_type,
+        dl_type=dl_type,
+        norm=norm,
+        min_df=min_df,
+        max_df=max_df,
+        max_n_terms=max_n_terms,
+        vocabulary_terms=vocabulary_terms
+    )
+    doc_term_matrix = vectorizer.fit_transform(tokenized_docs)
+    gen_model.doc_term_matrix = doc_term_matrix
+    return vectorizer

gradiobee/insert_spaces.py ADDED Viewed

	@@ -0,0 +1,14 @@

+"""Insert spaces, mypython/split_chinese.py."""
+import re
+def insert_spaces(text: str) -> str:
+    """Insert space in Chinese characters.
+    >>> insert_spaces("test亨利it四世上")
+    ' test 亨 利 it 四 世 上 '
+    >>> insert_spaces("test亨利it四世上").strip().__len__()
+    17
+    """
+    return re.sub(r"(?<=[a-zA-Z\d]) (?=[a-zA-Z\d])", "", text.replace("", " "))

gradiobee/mdx_dict_e2c.lzma ADDED Viewed

Binary file (1.18 MB). View file

gradiobee/mdx_e2c.py ADDED Viewed

	@@ -0,0 +1,40 @@

+"""Load mdx_dict_e2c c2e.
+mdx_e2c = joblib.load("./mdx_dict_e2c.lzma")
+mdx_c2e = joblib.load("./mdx_dict_e2c.lzma")
+"""
+from pathlib import Path
+from string import punctuation
+import joblib
+# keep "-"
+punctuation = punctuation.replace("-", "")
+c_dir = Path(__file__).parent
+# lazy load in __init__.py like this?
+# mdx_dict_e2c = importlib.import_module("mdx_dict_e2c")
+# mdx_e2c = mdx_dict_e2c.mdx_e2c
+# mdx_dict_c2e = importlib.import_module("mdx_dict_c2e")
+# mdx_c2e = mdx_dict_c2e.mdx_c2e
+mdx_dict_e2c = joblib.load(c_dir / "mdx_dict_e2c.lzma")
+print("e2c lzma file loaded")
+# memory = joblib.Memory("joblibcache", verbose=0)
+# @memory.cache  # no need, mdx_dict_e2c in RAM
+def mdx_e2c(word: str) -> str:
+    """Fetch definition for word.
+    Args:
+        word: word to look up
+    Returns:
+        definition entry or word itself
+    >>> mdx_e2c("do").__len__()
+    43
+    >>> mdx_e2c("我").strip()
+    '我'
+    """
+    word = word.strip(punctuation + " \t\n\r")
+    return mdx_dict_e2c.get(word.lower(), word)

gradiobee/plot_df.py ADDED Viewed

	@@ -0,0 +1,98 @@

+"""Plot pandas.DataFrame with DBSCAN clustering."""
+# pylint: disable=invalid-name, too-many-arguments
+# import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+import seaborn as sns
+from sklearn.cluster import DBSCAN
+from logzero import logger
+# turn interactive when in ipython session
+if "get_ipython" in globals():
+    plt.ion()
+# fmt: off
+def plot_df(
+        df_: pd.DataFram,
+        min_samples: int = 6,
+        eps: float = 10,
+        ylim: int = None,
+        xlabel: str = "en",
+        ylabel: str = "zh",
+) -> plt:
+    # fmt: on
+    """Plot df with DBSCAN clustering.
+    Args:
+        df_: pandas.DataFrame, with three columns columns=["x", "y", "cos"]
+    Returns:
+        matplotlib.pyplot: for possible use in gradio
+    plot_df(pd.DataFrame(cmat2tset(smat), columns=['x', 'y', 'cos']))
+    df_ = pd.DataFrame(cmat2tset(smat), columns=['x', 'y', 'cos'])
+    # sort 'x', axis 0 changes, index regenerated
+    df_s = df_.sort_values('x', axis=0, ignore_index=True)
+    # sorintg does not seem to impact clustering
+    DBSCAN(1.5, min_samples=3).fit(df_).labels_
+    DBSCAN(1.5, min_samples=3).fit(df_s).labels_
+    """
+    df_ = pd.DataFrame(df_)
+    if df_.columns.__len__() < 3:
+        logger.error(
+            "expected 3 columns DataFram, got: %s, cant proceed, returninng None",
+            df_.columns.tolist(),
+        )
+        return None
+    # take first three columns
+    columns = df_.columns[:3]
+    df_ = df_[columns]
+    # rename columns to "x", "y", "cos"
+    df_.columns = ["x", "y", "cos"]
+    sns.set()
+    sns.set_style("darkgrid")
+    fig, (ax0, ax1) = plt.subplots(2, figsize=(11.69, 8.27))
+    fig.suptitle("alignment projection")
+    _ = DBSCAN(min_samples=min_samples, eps=eps).fit(df_).labels_ > -1
+    _x = DBSCAN(min_samples=min_samples, eps=eps).fit(df_).labels_ < 0
+    # ax0.scatter(df_[_].x, df_[_].y, marker='o', c='g', alpha=0.5)
+    # ax0.grid()
+    # print("ratio: %.2f%%" % (100 * sum(_)/len(df_)))
+    df_.plot.scatter("x", "y", c="cos", cmap="viridis_r", ax=ax0)
+    # clustered
+    df_[_].plot.scatter("x", "y", c="cos", cmap="viridis_r", ax=ax1)
+    # outliers
+    df_[_x].plot.scatter("x", "y", c="r", marker="x", alpha=0.6, ax=ax0)
+    # ax0.set_xlabel("")
+    # ax0.set_ylabel("zh")
+    ax0.set_xlabel("")
+    ax0.set_ylabel(ylabel)
+    xlim = len(df_)
+    ax0.set_xlim(0, xlim)
+    if ylim:
+        ax0.set_ylim(0, ylim)
+    ax0.set_title("max similarity along columns (outliers denoted by 'x')")
+    # ax1.set_xlabel("en")
+    # ax1.set_ylabel("zh")
+    ax1.set_xlabel(xlabel)
+    ax1.set_ylabel(ylabel)
+    ax1.set_xlim(0, xlim)
+    if ylim:
+        ax1.set_ylim(0, ylim)
+    ax1.set_title(f"potential aligned pairs ({round(sum(_) / len(df_), 2):.0%})")
+    return plt

gradiobee/smatrix.py ADDED Viewed

	@@ -0,0 +1,100 @@

+"""Generate a similarity matrix (doc-term score matrix) based on textacy.representation.Vectorizer.
+refer also to fast-scores fast_scores.py and gen_model.py (sklearn.feature_extraction.text.TfidfVectorizer).
+originally docterm_scores.py.
+"""
+from typing import Dict, Iterable, List, Optional, Union
+import numpy as np
+from itertools import chain
+from psutil import virtual_memory
+from more_itertools import ilen
+from textacy.representations import Vectorizer
+# from textacy.representations.vectorizers import Vectorizer
+from logzero import logger
+# from smatrix.gen_model import gen_model
+from gradiobee.gen_model import gen_model
+# fmt: off
+def smatrix(
+        doc1: Iterable[Iterable[str]],  # List[List[str]],
+        doc2: Iterable[Iterable[str]],
+        model: Vectorizer = None,
+        tf_type: str = 'linear',
+        idf_type: Optional[str] = "smooth",
+        # dl_type: Optional[str] = "sqrt",  # "lucene-style tfidf"
+        dl_type: Optional[str] = None,  #
+        norm: Optional[str] = "l2",  # + "l2"
+        min_df: Union[int, float] = 1,
+        max_df: Union[int, float] = 1.0,
+        max_n_terms: Optional[int] = None,
+        vocabulary_terms: Optional[Union[Dict[str, int], Iterable[str]]] = None
+) -> np.ndarray:
+    # fmt: on
+    """Generate a doc-term score matrix based on textacy.representation.Vectorizer.
+    Args
+        doc1: tokenized doc of n1
+        doc2: tokenized doc of n2
+        model: if None, generate one ad hoc from doc1 and doc2 ("lucene-style tfidf").
+        rest: refer to textacy.representation.Vectorizer
+    Attributes
+        vectorizer
+    Returns
+        n1 x n2 similarity matrix of float numbers
+    """
+    # make sure doc1/doc2 is of the right typing
+    try:
+        for xelm in iter(doc1):
+            for elm in iter(xelm):
+                assert isinstance(elm, str)
+    except AssertionError:
+        raise AssertionError(" doc1 is not of the typing  Iterable[Iterable[str]] ")
+    except Exception as e:
+        logger.error(e)
+        raise
+    try:
+        for xelm in iter(doc2):
+            for elm in iter(xelm):
+                assert isinstance(elm, str)
+    except AssertionError:
+        raise AssertionError(" doc2 is not of the typing  Iterable[Iterable[str]] ")
+    except Exception as e:
+        logger.error(e)
+        raise
+    if model is None:
+        model = gen_model(
+            [*chain(doc1, doc2)],
+            tf_type=tf_type,
+            idf_type=idf_type,
+            dl_type=dl_type,
+            norm=norm,
+            min_df=min_df,
+            max_df=max_df,
+            max_n_terms=max_n_terms,
+            vocabulary_terms=vocabulary_terms
+        )
+        # docterm_scores.model = model
+        smatrix.model = model
+    # a1 = dt.toarray(), a2 = doc_term_matrix.toarray()
+    # np.all(np.isclose(a1, a2))
+    dt1 = model.transform(doc1)
+    dt2 = model.transform(doc2)
+    # virtual_memory().available / 8: 64bits float
+    require_ram = ilen(iter(doc1)) * ilen(iter(doc2)) * 8
+    if require_ram > virtual_memory().available:
+        logger.warning("virtual_memory().available: %s", virtual_memory().available)
+        logger.warning("memory required: %s", require_ram)
+    if require_ram > virtual_memory().available * 10:
+        logger.warning("You're likely to encounter memory problem, such as slowing down response and/or OOM.")
+    # return dt1.doc(dt2.T)
+    return dt2.toarray().dot(dt1.toarray().T)

pyrightconfig.json ADDED Viewed

	@@ -0,0 +1,11 @@

+{
+  "include": ["tests", "gradiobee"],
+  "venvPath": ".venv/Scripts",
+  "reportTypeshedErrors": false,
+  "reportMissingImports": true,
+  "reportMissingTypeStubs": false,
+  "pythonVersion": "3.7",
+  "ignore": []
+}

requirements.txt CHANGED Viewed

@@ -2,4 +2,10 @@ chardet
 certifi
 charset-normalizer
 idna
-typing-extensions

 certifi
 charset-normalizer
 idna
+typing-extensions
+sklearn
+textacy
+logzero
+more_itertools
+psutil
+seaborn