Spaces:
Runtime error
Runtime error
a-v-bely
commited on
Commit
·
686cd54
1
Parent(s):
cd5e287
Update app
Browse files- pages/2_👨🏫_Начало_работы.py +62 -94
- pages/3_📥_Скачать.py +7 -2
- pages/4_📝_Онлайн-тест (эксперимент).py +14 -13
- requirements.txt +16 -14
- utilities_cookies/cookie_manager.py +2 -4
- utilities_cookies/encrypted_cookie_manager.py +1 -3
- utilities_database/user_database_utils.py +7 -7
- utilities_database/user_database_widgets.py +3 -4
- utilities_language_bert/rus_main_workflow_bert.py +70 -72
- utilities_language_bert/rus_sentence_bert.py +18 -41
- utilities_language_general/rus_constants.py +71 -15
- utilities_language_general/rus_utils.py +98 -17
- utilities_language_general/similarity_measures.py +255 -0
- utilities_language_w2v/rus_main_workflow_w2v.py +71 -74
- utilities_language_w2v/rus_sentence_w2v.py +26 -43
pages/2_👨🏫_Начало_работы.py
CHANGED
|
@@ -4,7 +4,6 @@ from utilities_database.user_database_utils import load_user_tasks_data
|
|
| 4 |
from utilities_database.user_database_utils import save_data_in_database
|
| 5 |
from utilities_database.user_database_widgets import user_save_text_table
|
| 6 |
from utilities_database.user_database_utils import load_users_particular_task
|
| 7 |
-
from streamlit_extras.no_default_selectbox import selectbox as custom_select_box
|
| 8 |
|
| 9 |
# Interface
|
| 10 |
if st.session_state.get('-LOGGED_IN_BOOL-'):
|
|
@@ -15,52 +14,38 @@ if st.session_state.get('-LOGGED_IN_BOOL-'):
|
|
| 15 |
'\n\n**_:red[СОЗДАНИЕ ЗАДАНИЙ]_**'
|
| 16 |
'\n\nПосле выбора данного режима работы появится форма, которую необходимо заполнить:'
|
| 17 |
'\n\n1. Придумайте **название** для файла с заданиями. '
|
| 18 |
-
'Вы можете оставить это поле пустым - именем по умолчанию служит текущая дата и первые 20 символов
|
| 19 |
-
'введенного Вами текста.'
|
| 20 |
-
'\n\n2. Введите **текст** или выберите **текстовый файл** с исходным текстом, на основе которого Вы хотите
|
| 21 |
-
'создать задания. '
|
| 22 |
'\n\n3. Укажите *способ выбора целевых слов*:'
|
| 23 |
'\n\t* *:green[Автоматически]*: программа сама выберет подходящие по сложности целевые слова.'
|
| 24 |
-
'\n\t* *:blue[Самостоятельно]*: введите в соответствующее поле целевые слова через запятую в той форме,
|
| 25 |
-
'в которой они встречаются в тексте. В этом случае *:orange[языковой уровень]* можно не указывать, но тогда
|
| 26 |
-
'дистракторы будут полностью случайными и несоотнесёнными с уровнем.'
|
| 27 |
-
'\n4. Если Вы выбрали *:green[автоматический поиск целевых слов]*, **_:red[обязательно]_** укажите
|
| 28 |
-
'*:orange[языковой уровень]*. Данный параметр отвечает за выбор лексического минимума, использующегося при
|
| 29 |
-
'подборе дистракторов.'
|
| 30 |
-
'\n5. Если Вы выбрали *:blue[самостоятельный ввод целевых слов]*, проверьте, что заполнили соответствующее
|
| 31 |
-
'поле. ️ ❗ **:red[Введите слова в той форме, в которой они встречаются в тексте]**.'
|
| 32 |
-
'\n6. Укажите число дистракторов - неправильных вариантов ответа. Если указано _более четырех_
|
| 33 |
-
'дистракторов, возможно, что в некоторых заданиях будет выведено _меньшее количество, но не менее четырех_
|
| 34 |
-
'вариантов. Данное обстоятельство связано с проверкой наличия дистракторов в лексических минимумах.'
|
| 35 |
'\n7. Выберите **способы вывода** готовых материалов.'
|
| 36 |
-
'\n8. Для начала работы нажмите на кнопку **"Запуск"**. Если все поля заполнены верно,
|
| 37 |
-
'начнется процесс генерации заданий. Прогресс будет отображаться на экране.'
|
| 38 |
'\n9. По окончании процесса генерации заданий будет выведено **_:green[соответсвующее сообщение]_**. '
|
| 39 |
'Затем Вы можете перейти на вкладки **просмотра и 📥 сохранения** заданий, а так же 📝**онлайн-теста**.'
|
| 40 |
'\n\n**_:red[ЗАГРУЗКА ИЗ АРХИВА]_**'
|
| 41 |
-
'\n\nПосле выбора данного режима работы появится таблица, в которой перечислены названия заданий,
|
| 42 |
-
'которые Вы сохранили, языковой уровень и дата их создания.'
|
| 43 |
' Для загрузки определенного файла с заданиями:'
|
| 44 |
'\n1. Введите (или скопируйте из таблицы) название.'
|
| 45 |
'\n2. Укажите соответсвующий языковой уровень.'
|
| 46 |
'\n3. Нажмите на кнопку **"Загрузить"**.'
|
| 47 |
'\n4. Если все поля заполнены верно, Вы увидите сообщение о том, что **:green[задания успешно загружены]**.'
|
| 48 |
-
'\n\n\nДля того,
|
| 49 |
-
'кликните по заголовку этого блока или по стрелке (ᐯ / ᐱ), располагающейся в его правом верхнем углу.')
|
| 50 |
-
ANNOUNCES = st.expander('**ВАЖНАЯ ИНФОРМАЦИЯ**', expanded=False)
|
| 51 |
-
ANNOUNCES.success(
|
| 52 |
-
'**Уважаемые пользователи, пожалуйста, после генерации заданий перейдите на вкладку "📝Онлайн-тест" '
|
| 53 |
-
'и заполните там опросник. Таким образом Вы очень поможете в улучшении качества заданий! Спасибо!🤗**')
|
| 54 |
-
ANNOUNCES.warning(
|
| 55 |
-
'**Сейчас генератор проходит завершающую настройку и отладку, для которой необходимо большое количество '
|
| 56 |
-
'данных об уместности выбранных целевых слов и дистракторов к ним. Поэтому просим Вас отнестись с пониманием '
|
| 57 |
-
'к излишне большому количеству заданий.**')
|
| 58 |
-
ANNOUNCES.warning(
|
| 59 |
-
'**❗️ㅤУбедительно просим Вас дожидаться окончания генерации или загрузки и не переходить на '
|
| 60 |
-
'другие вкладки до выведения соответствующего сообщения.**')
|
| 61 |
-
ANNOUNCES.warning(
|
| 62 |
-
'**❗ㅤВ случае появления красных сообщений об ошибке, как правило, проблема решается '
|
| 63 |
-
'повторными нажатиями на нужный Вам элемент. Приносим извинения за неудобства.**')
|
| 64 |
WHAT_TO_DO = st.radio(
|
| 65 |
label='**Выберите режим работы**',
|
| 66 |
options=[
|
|
@@ -79,10 +64,11 @@ if st.session_state.get('-LOGGED_IN_BOOL-'):
|
|
| 79 |
COL1, COL2 = LOAD_FORM.columns([1, 1])
|
| 80 |
UPLOAD_CLOUD_FILE_NAME = COL1.text_input('Введите название заданий', placeholder='Жду название')
|
| 81 |
with COL2:
|
| 82 |
-
UPLOAD_CLOUD_CEFR_LEVEL =
|
| 83 |
-
'Выберите языковой уровень',
|
| 84 |
-
['A1', 'A2', 'B1', 'B2', 'Без уровня'],
|
| 85 |
-
|
|
|
|
| 86 |
st.session_state['-UPLOAD_CLOUD_CEFR_LEVEL-'] = UPLOAD_CLOUD_CEFR_LEVEL
|
| 87 |
LOAD_BUTTON = LOAD_FORM.form_submit_button('Загрузить')
|
| 88 |
if LOAD_BUTTON:
|
|
@@ -132,22 +118,23 @@ if st.session_state.get('-LOGGED_IN_BOOL-'):
|
|
| 132 |
key='-DISTRACTOR_MODEL_MODE-', horizontal=True)
|
| 133 |
CEFR_NUM_DISTRACTORS_COL, UTW_COL = FORM.columns([2, 2])
|
| 134 |
with CEFR_NUM_DISTRACTORS_COL:
|
| 135 |
-
CEFR_TEXT_LEVEL =
|
| 136 |
-
'
|
| 137 |
-
['
|
| 138 |
-
|
|
|
|
| 139 |
st.session_state['-CEFR_TEXT_LEVEL-'] = CEFR_TEXT_LEVEL
|
| 140 |
NUMBER_DISTRACTORS = CEFR_NUM_DISTRACTORS_COL.number_input(
|
| 141 |
label='**Выберите количество дистракторов в задании:**',
|
| 142 |
min_value=1,
|
| 143 |
max_value=9,
|
| 144 |
-
value=
|
| 145 |
key='-NUM_DISTRACTORS-')
|
| 146 |
TARGET_WORDS = UTW_COL.text_area(
|
| 147 |
label='**Если "Самостоятельно", введите целевые слова:**',
|
| 148 |
value='',
|
| 149 |
height=120,
|
| 150 |
-
placeholder='Через запятую
|
| 151 |
key='-INPUT_TARGET_WORDS-')
|
| 152 |
FORM.markdown('**Выберите формат(-ы) вывода:**')
|
| 153 |
col1, col2, col3 = FORM.columns(3)
|
|
@@ -166,47 +153,44 @@ if st.session_state.get('-LOGGED_IN_BOOL-'):
|
|
| 166 |
|
| 167 |
START_COL, RERUN_COL, EXIT_COL = FORM.columns([1, 1, 1])
|
| 168 |
START_BUTTON = START_COL.form_submit_button(
|
| 169 |
-
label='**Запуск**'
|
|
|
|
| 170 |
RERUN_BUTTON = RERUN_COL.form_submit_button(
|
| 171 |
-
label='**Перезагрузка**'
|
|
|
|
| 172 |
EXIT_BUTTON = EXIT_COL.form_submit_button(
|
| 173 |
-
label='**Выход**'
|
|
|
|
| 174 |
|
| 175 |
if START_BUTTON:
|
| 176 |
-
LOGS = st.empty()
|
| 177 |
-
LOGS.success('**Готовые задания или онлайн-тест появятся на отдельных страницах**')
|
| 178 |
-
|
| 179 |
# Initiate interface structure
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
|
|
|
| 183 |
|
| 184 |
# Start generation process. Everything happens inside main_workflow func
|
| 185 |
if DISTRACTOR_MODEL == 'Модель-3':
|
| 186 |
-
from utilities_language_bert.rus_main_workflow_bert import
|
| 187 |
-
__TASK_DATA__ =
|
| 188 |
file=UPLOAD_FILE,
|
| 189 |
text=UPLOAD_TEXT,
|
| 190 |
logs=LOGS,
|
| 191 |
-
logs_d=LOGS_D,
|
| 192 |
progress=PROGRESS_BAR,
|
| 193 |
-
|
| 194 |
level=CEFR_TEXT_LEVEL,
|
| 195 |
tw_mode_automatic_mode=TARGET_WORDS_MODE,
|
| 196 |
target_words=TARGET_WORDS,
|
| 197 |
num_distractors=NUMBER_DISTRACTORS,
|
| 198 |
save_name=USER__SAVE_IN_CLOUD_FILE_NAME)
|
| 199 |
else:
|
| 200 |
-
|
| 201 |
-
|
| 202 |
-
__TASK_DATA__ = main_workflow_w2v(
|
| 203 |
file=UPLOAD_FILE,
|
| 204 |
text=UPLOAD_TEXT,
|
| 205 |
logs=LOGS,
|
| 206 |
-
logs_d=LOGS_D,
|
| 207 |
progress=PROGRESS_BAR,
|
| 208 |
-
progress_d=
|
| 209 |
-
progress_s=PROGRESS_BAR_S,
|
| 210 |
level=CEFR_TEXT_LEVEL,
|
| 211 |
tw_mode_automatic_mode=TARGET_WORDS_MODE,
|
| 212 |
target_words=TARGET_WORDS,
|
|
@@ -225,28 +209,18 @@ if st.session_state.get('-LOGGED_IN_BOOL-'):
|
|
| 225 |
st.session_state['-UPLOAD_CLOUD_FILE_NAME-'] = USER__SAVE_IN_CLOUD_FILE_NAME
|
| 226 |
|
| 227 |
PROGRESS_BAR.progress(100)
|
| 228 |
-
|
| 229 |
-
|
| 230 |
-
|
| 231 |
-
|
| 232 |
-
|
| 233 |
-
|
| 234 |
-
|
| 235 |
-
|
| 236 |
-
|
| 237 |
-
|
| 238 |
-
|
| 239 |
-
|
| 240 |
-
'соответсвующую вкладку**')
|
| 241 |
-
else:
|
| 242 |
-
LOGS.success('**Все готово! Готовые задания и/или онлайн-тест доступны в соответствующих вкладках.**')
|
| 243 |
-
|
| 244 |
-
# if TARGET_WORDS_MODE == 'Самостоятельно':
|
| 245 |
-
# st.error('По разным причинам не смогли не смогли придумать задания со словами: ' +
|
| 246 |
-
# ', '.join(__TASK_DATA__['BAD_USER_TARGET_WORDS']))
|
| 247 |
-
PROGRESS_BAR = ''
|
| 248 |
-
PROGRESS_BAR_S = ''
|
| 249 |
-
LOGS_D = ''
|
| 250 |
|
| 251 |
if EXIT_BUTTON:
|
| 252 |
for key in st.session_state:
|
|
@@ -260,12 +234,6 @@ if st.session_state.get('-LOGGED_IN_BOOL-'):
|
|
| 260 |
st.error('Что-то пошло не так?! Перезагружаюсь!')
|
| 261 |
st.session_state["START_GENERATION"] = False
|
| 262 |
st.stop()
|
| 263 |
-
st.
|
| 264 |
-
|
| 265 |
-
# LABEL
|
| 266 |
-
# st.markdown('*Автор-разработчик: А.В.Белый, кафедра математической лингвистики, филологический факультет СПбГУ,'
|
| 267 |
-
# ' 4 курс, бакалавриат, "Прикладная, компьютерная и математическая лингвистика (английский язык)"*'
|
| 268 |
-
# '\n\n*Научный руководитель: канд. филол. наук, доц. О.А.Митрофанова*')
|
| 269 |
-
# st.markdown('*E-mail: [email protected]*')
|
| 270 |
else:
|
| 271 |
st.warning('**Войдите или зарегистрируйтесь**')
|
|
|
|
| 4 |
from utilities_database.user_database_utils import save_data_in_database
|
| 5 |
from utilities_database.user_database_widgets import user_save_text_table
|
| 6 |
from utilities_database.user_database_utils import load_users_particular_task
|
|
|
|
| 7 |
|
| 8 |
# Interface
|
| 9 |
if st.session_state.get('-LOGGED_IN_BOOL-'):
|
|
|
|
| 14 |
'\n\n**_:red[СОЗДАНИЕ ЗАДАНИЙ]_**'
|
| 15 |
'\n\nПосле выбора данного режима работы появится форма, которую необходимо заполнить:'
|
| 16 |
'\n\n1. Придумайте **название** для файла с заданиями. '
|
| 17 |
+
'Вы можете оставить это поле пустым - именем по умолчанию служит текущая дата и первые 20 символов'
|
| 18 |
+
' введенного Вами текста.'
|
| 19 |
+
'\n\n2. Введите **текст** или выберите **текстовый файл** с исходным текстом, на основе которого Вы хотите'
|
| 20 |
+
' создать задания. '
|
| 21 |
'\n\n3. Укажите *способ выбора целевых слов*:'
|
| 22 |
'\n\t* *:green[Автоматически]*: программа сама выберет подходящие по сложности целевые слова.'
|
| 23 |
+
'\n\t* *:blue[Самостоятельно]*: введите в соответствующее поле целевые слова через запятую в той форме,'
|
| 24 |
+
' в которой они встречаются в тексте. В этом случае *:orange[языковой уровень]* можно не указывать, но тогда'
|
| 25 |
+
' дистракторы будут полностью случайными и несоотнесёнными с уровнем.'
|
| 26 |
+
'\n4. Если Вы выбрали *:green[автоматический поиск целевых слов]*, **_:red[обязательно]_** укажите'
|
| 27 |
+
' *:orange[языковой уровень]*. Данный параметр отвечает за выбор лексического минимума, использующегося при'
|
| 28 |
+
' подборе дистракторов.'
|
| 29 |
+
'\n5. Если Вы выбрали *:blue[самостоятельный ввод целевых слов]*, проверьте, что заполнили соответствующее'
|
| 30 |
+
' поле. ️ ❗ **:red[Введите слова в той форме, в которой они встречаются в тексте]**.'
|
| 31 |
+
'\n6. Укажите число дистракторов - неправильных вариантов ответа. Если указано _более четырех_'
|
| 32 |
+
' дистракторов, возможно, что в некоторых заданиях будет выведено _меньшее количество, но не менее четырех_'
|
| 33 |
+
' вариантов. Данное обстоятельство связано с проверкой наличия дистракторов в лексических минимумах.'
|
| 34 |
'\n7. Выберите **способы вывода** готовых материалов.'
|
| 35 |
+
'\n8. Для начала работы нажмите на кнопку **"Запуск"**. Если все поля заполнены верно,'
|
| 36 |
+
' начнется процесс генерации заданий. Прогресс будет отображаться на экране.'
|
| 37 |
'\n9. По окончании процесса генерации заданий будет выведено **_:green[соответсвующее сообщение]_**. '
|
| 38 |
'Затем Вы можете перейти на вкладки **просмотра и 📥 сохранения** заданий, а так же 📝**онлайн-теста**.'
|
| 39 |
'\n\n**_:red[ЗАГРУЗКА ИЗ АРХИВА]_**'
|
| 40 |
+
'\n\nПосле выбора данного режима работы появится таблица, в которой перечислены названия заданий,'
|
| 41 |
+
' которые Вы сохранили, языковой уровень и дата их создания.'
|
| 42 |
' Для загрузки определенного файла с заданиями:'
|
| 43 |
'\n1. Введите (или скопируйте из таблицы) название.'
|
| 44 |
'\n2. Укажите соответсвующий языковой уровень.'
|
| 45 |
'\n3. Нажмите на кнопку **"Загрузить"**.'
|
| 46 |
'\n4. Если все поля заполнены верно, Вы увидите сообщение о том, что **:green[задания успешно загружены]**.'
|
| 47 |
+
'\n\n\nДля того, чтоб�� свернуть/развернуть блоки **Инструкций** или **Важной информации**,'
|
| 48 |
+
' кликните по заголовку этого блока или по стрелке (ᐯ / ᐱ), располагающейся в его правом верхнем углу.')
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
WHAT_TO_DO = st.radio(
|
| 50 |
label='**Выберите режим работы**',
|
| 51 |
options=[
|
|
|
|
| 64 |
COL1, COL2 = LOAD_FORM.columns([1, 1])
|
| 65 |
UPLOAD_CLOUD_FILE_NAME = COL1.text_input('Введите название заданий', placeholder='Жду название')
|
| 66 |
with COL2:
|
| 67 |
+
UPLOAD_CLOUD_CEFR_LEVEL = st.selectbox(
|
| 68 |
+
label='Выберите языковой уровень',
|
| 69 |
+
options=['A1', 'A2', 'B1', 'B2', 'C1', 'Без уровня'],
|
| 70 |
+
index=None,
|
| 71 |
+
placeholder='-Выберите языковой уровень-')
|
| 72 |
st.session_state['-UPLOAD_CLOUD_CEFR_LEVEL-'] = UPLOAD_CLOUD_CEFR_LEVEL
|
| 73 |
LOAD_BUTTON = LOAD_FORM.form_submit_button('Загрузить')
|
| 74 |
if LOAD_BUTTON:
|
|
|
|
| 118 |
key='-DISTRACTOR_MODEL_MODE-', horizontal=True)
|
| 119 |
CEFR_NUM_DISTRACTORS_COL, UTW_COL = FORM.columns([2, 2])
|
| 120 |
with CEFR_NUM_DISTRACTORS_COL:
|
| 121 |
+
CEFR_TEXT_LEVEL = st.selectbox(
|
| 122 |
+
label='Выберите языковой уровень',
|
| 123 |
+
options=['A1', 'A2', 'B1', 'B2', 'Без уровня'],
|
| 124 |
+
index=None,
|
| 125 |
+
placeholder='-Выберите языковой уровень-')
|
| 126 |
st.session_state['-CEFR_TEXT_LEVEL-'] = CEFR_TEXT_LEVEL
|
| 127 |
NUMBER_DISTRACTORS = CEFR_NUM_DISTRACTORS_COL.number_input(
|
| 128 |
label='**Выберите количество дистракторов в задании:**',
|
| 129 |
min_value=1,
|
| 130 |
max_value=9,
|
| 131 |
+
value=3,
|
| 132 |
key='-NUM_DISTRACTORS-')
|
| 133 |
TARGET_WORDS = UTW_COL.text_area(
|
| 134 |
label='**Если "Самостоятельно", введите целевые слова:**',
|
| 135 |
value='',
|
| 136 |
height=120,
|
| 137 |
+
placeholder='Через запятую',
|
| 138 |
key='-INPUT_TARGET_WORDS-')
|
| 139 |
FORM.markdown('**Выберите формат(-ы) вывода:**')
|
| 140 |
col1, col2, col3 = FORM.columns(3)
|
|
|
|
| 153 |
|
| 154 |
START_COL, RERUN_COL, EXIT_COL = FORM.columns([1, 1, 1])
|
| 155 |
START_BUTTON = START_COL.form_submit_button(
|
| 156 |
+
label='**Запуск**',
|
| 157 |
+
use_container_width=True)
|
| 158 |
RERUN_BUTTON = RERUN_COL.form_submit_button(
|
| 159 |
+
label='**Перезагрузка**',
|
| 160 |
+
use_container_width=True)
|
| 161 |
EXIT_BUTTON = EXIT_COL.form_submit_button(
|
| 162 |
+
label='**Выход**',
|
| 163 |
+
use_container_width=True)
|
| 164 |
|
| 165 |
if START_BUTTON:
|
|
|
|
|
|
|
|
|
|
| 166 |
# Initiate interface structure
|
| 167 |
+
LOGS = st.status(label='Прогресс выполнения', expanded=True)
|
| 168 |
+
|
| 169 |
+
PROGRESS_BAR = LOGS.progress(0)
|
| 170 |
+
PROGRESS_BAR_DISTRACTORS = LOGS.progress(0)
|
| 171 |
|
| 172 |
# Start generation process. Everything happens inside main_workflow func
|
| 173 |
if DISTRACTOR_MODEL == 'Модель-3':
|
| 174 |
+
from utilities_language_bert.rus_main_workflow_bert import main_workflow
|
| 175 |
+
__TASK_DATA__ = main_workflow(
|
| 176 |
file=UPLOAD_FILE,
|
| 177 |
text=UPLOAD_TEXT,
|
| 178 |
logs=LOGS,
|
|
|
|
| 179 |
progress=PROGRESS_BAR,
|
| 180 |
+
progress_d=PROGRESS_BAR_DISTRACTORS,
|
| 181 |
level=CEFR_TEXT_LEVEL,
|
| 182 |
tw_mode_automatic_mode=TARGET_WORDS_MODE,
|
| 183 |
target_words=TARGET_WORDS,
|
| 184 |
num_distractors=NUMBER_DISTRACTORS,
|
| 185 |
save_name=USER__SAVE_IN_CLOUD_FILE_NAME)
|
| 186 |
else:
|
| 187 |
+
from utilities_language_w2v.rus_main_workflow_w2v import main_workflow
|
| 188 |
+
__TASK_DATA__ = main_workflow(
|
|
|
|
| 189 |
file=UPLOAD_FILE,
|
| 190 |
text=UPLOAD_TEXT,
|
| 191 |
logs=LOGS,
|
|
|
|
| 192 |
progress=PROGRESS_BAR,
|
| 193 |
+
progress_d=PROGRESS_BAR_DISTRACTORS,
|
|
|
|
| 194 |
level=CEFR_TEXT_LEVEL,
|
| 195 |
tw_mode_automatic_mode=TARGET_WORDS_MODE,
|
| 196 |
target_words=TARGET_WORDS,
|
|
|
|
| 209 |
st.session_state['-UPLOAD_CLOUD_FILE_NAME-'] = USER__SAVE_IN_CLOUD_FILE_NAME
|
| 210 |
|
| 211 |
PROGRESS_BAR.progress(100)
|
| 212 |
+
PROGRESS_BAR_DISTRACTORS.progress(100)
|
| 213 |
+
LOGS.update(label='**Все готово! Готовые задания и/или онлайн-тест доступны в соответствующих вкладках.**',
|
| 214 |
+
state='complete', expanded=False)
|
| 215 |
+
save_data_in_database(
|
| 216 |
+
user_task_database=user_save_text_table,
|
| 217 |
+
save_type='download',
|
| 218 |
+
save_name=USER__SAVE_IN_CLOUD_FILE_NAME,
|
| 219 |
+
cefr_level=CEFR_TEXT_LEVEL,
|
| 220 |
+
created_at=str(datetime.datetime.now())[:-7],
|
| 221 |
+
creator_name=st.session_state.get('-USER_NAME-'),
|
| 222 |
+
generated_result=__TASK_DATA__,
|
| 223 |
+
distractor_model=DISTRACTOR_MODEL, allow=SAVE_IN_CLOUD)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 224 |
|
| 225 |
if EXIT_BUTTON:
|
| 226 |
for key in st.session_state:
|
|
|
|
| 234 |
st.error('Что-то пошло не так?! Перезагружаюсь!')
|
| 235 |
st.session_state["START_GENERATION"] = False
|
| 236 |
st.stop()
|
| 237 |
+
st.rerun()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 238 |
else:
|
| 239 |
st.warning('**Войдите или зарегистрируйтесь**')
|
pages/3_📥_Скачать.py
CHANGED
|
@@ -1,7 +1,7 @@
|
|
| 1 |
import streamlit as st
|
| 2 |
from utilities_ui.custom_download_button import download_button as d_button
|
| 3 |
|
| 4 |
-
st.set_page_config(page_title='Скачать', layout="wide", page_icon=':ru:')
|
| 5 |
if st.session_state.get('-LOGGED_IN_BOOL-') and (st.session_state.get('-DISPLAY_READY-')
|
| 6 |
or st.session_state.get('-DOWNLOAD_VERSION-')):
|
| 7 |
result = st.session_state.get('RESULT')
|
|
@@ -10,7 +10,7 @@ if st.session_state.get('-LOGGED_IN_BOOL-') and (st.session_state.get('-DISPLAY_
|
|
| 10 |
st.stop()
|
| 11 |
# Download buttons
|
| 12 |
if st.session_state.get('-DOWNLOAD_VERSION-'):
|
| 13 |
-
invite, tasks_col, tasks_with_answers_col, full_coll, rest = st.columns([1, 1, 2, 3, 1])
|
| 14 |
invite.write('Скачать:')
|
| 15 |
with tasks_col:
|
| 16 |
d_button(
|
|
@@ -22,6 +22,11 @@ if st.session_state.get('-LOGGED_IN_BOOL-') and (st.session_state.get('-DISPLAY_
|
|
| 22 |
label='Задания+Ключи',
|
| 23 |
data=result['TEACHER_OUT'],
|
| 24 |
file_name=f'{result["name"]}_tasks_and_keys.txt')
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
with full_coll:
|
| 26 |
d_button(
|
| 27 |
label='Исходник+Задания+Ключи',
|
|
|
|
| 1 |
import streamlit as st
|
| 2 |
from utilities_ui.custom_download_button import download_button as d_button
|
| 3 |
|
| 4 |
+
st.set_page_config(page_title='Скачать', layout="wide", page_icon=':ru:', initial_sidebar_state='collapsed')
|
| 5 |
if st.session_state.get('-LOGGED_IN_BOOL-') and (st.session_state.get('-DISPLAY_READY-')
|
| 6 |
or st.session_state.get('-DOWNLOAD_VERSION-')):
|
| 7 |
result = st.session_state.get('RESULT')
|
|
|
|
| 10 |
st.stop()
|
| 11 |
# Download buttons
|
| 12 |
if st.session_state.get('-DOWNLOAD_VERSION-'):
|
| 13 |
+
invite, tasks_col, tasks_with_answers_col, keys_only_col, full_coll, rest = st.columns([1, 1, 2, 1, 3, 1])
|
| 14 |
invite.write('Скачать:')
|
| 15 |
with tasks_col:
|
| 16 |
d_button(
|
|
|
|
| 22 |
label='Задания+Ключи',
|
| 23 |
data=result['TEACHER_OUT'],
|
| 24 |
file_name=f'{result["name"]}_tasks_and_keys.txt')
|
| 25 |
+
with keys_only_col:
|
| 26 |
+
d_button(
|
| 27 |
+
label='Ключи',
|
| 28 |
+
data=result['KEYS_ONLY'],
|
| 29 |
+
file_name=f'{result["name"]}_keys.txt')
|
| 30 |
with full_coll:
|
| 31 |
d_button(
|
| 32 |
label='Исходник+Задания+Ключи',
|
pages/4_📝_Онлайн-тест (эксперимент).py
CHANGED
|
@@ -4,19 +4,19 @@ import streamlit as st
|
|
| 4 |
from utilities_database.user_database_utils import save_data_in_database
|
| 5 |
from utilities_database.user_database_widgets import user_save_text_table
|
| 6 |
|
| 7 |
-
st.set_page_config(page_title='Онлайн-тест', layout="wide", page_icon=':ru:')
|
| 8 |
if st.session_state.get('-ONLINE_TEST_READY-') and st.session_state.get('-LOGGED_IN_BOOL-'):
|
| 9 |
-
INSTRUCTION = st.expander(label='**ИНСТРУКЦИЯ**', expanded=
|
| 10 |
INSTRUCTION.markdown(
|
| 11 |
'Уважаемые пользователи, предлагаем Вам заполнить опросник по оценке качества созданных заданий. '
|
| 12 |
'\n\nНиже находится анкета с заданиями в таблице.'
|
| 13 |
'\n\n- В **первом столбце** приводится ответ - слово, удаленное из оригинального текста.'
|
| 14 |
'\n\n- Отметьте во **втором столбце**, уместно ли создавать задание с данным словом.'
|
| 15 |
'\n\n- В **третьем столбце** приведены подобранные программой дистракторы.'
|
| 16 |
-
'\n\n- Введите в **четвертый столбец** дистракторы (целиком или букву), которые, по Вашему мнению,
|
| 17 |
-
'**:red[не уместны]**. '
|
| 18 |
-
'\n\n**:green[Уместными дистракторами]** мы предлагаем считать те, которые одновременно удовлетворяют
|
| 19 |
-
'следующим условиям в рамках языкового уровня, для которого они созданы:'
|
| 20 |
'\n\n1. не слишком очевидно являются неправильными вариантами (*варить суп/стол*);'
|
| 21 |
'\n\n2. и при этом не могут быть полноценной заменой удаленного слова (*варить суп/кашу*)'
|
| 22 |
)
|
|
@@ -31,25 +31,26 @@ if st.session_state.get('-ONLINE_TEST_READY-') and st.session_state.get('-LOGGED
|
|
| 31 |
ONLINE_TEST = st.form('Онлайн тест')
|
| 32 |
ONLINE_TEST.write(result['TEXT_WITH_GAPS'].replace('_', '\_'))
|
| 33 |
BAD_DISTRACTORS_AND_ANSWERS_temp = ONLINE_TEST.data_editor(
|
| 34 |
-
pd.DataFrame([{"Задание №": i+1,
|
| 35 |
"Ответ": [answers[i][1]],
|
| 36 |
"Задание уместно": False,
|
| 37 |
"Дистракторы": tasks[i][1],
|
| 38 |
"Неуместные дистракторы": ''}
|
| 39 |
for i in range(len(tasks))]),
|
| 40 |
num_rows="fixed",
|
| 41 |
-
height=
|
|
|
|
| 42 |
use_container_width=True)
|
| 43 |
-
COMMENTS = ONLINE_TEST.
|
| 44 |
-
|
| 45 |
-
SUBMIT = ONLINE_TEST.form_submit_button('
|
| 46 |
if SUBMIT:
|
| 47 |
points = test_mark = 'Teacher'
|
| 48 |
appropriate_tasks = BAD_DISTRACTORS_AND_ANSWERS_temp["Задание уместно"].values.tolist()
|
| 49 |
inappropriate_distractors = BAD_DISTRACTORS_AND_ANSWERS_temp["Неуместные дистракторы"].values.tolist()
|
| 50 |
RETURN_TEST_DATA = [{'ANSWER': answers[i],
|
| 51 |
'APPROPRIATE_TASK': appropriate_tasks[i],
|
| 52 |
-
|
| 53 |
save_data_in_database(user_task_database=user_save_text_table,
|
| 54 |
save_type='online_test',
|
| 55 |
save_name=st.session_state['-UPLOAD_CLOUD_FILE_NAME-'],
|
|
@@ -57,8 +58,8 @@ if st.session_state.get('-ONLINE_TEST_READY-') and st.session_state.get('-LOGGED
|
|
| 57 |
created_at=str(datetime.datetime.now())[:-7],
|
| 58 |
creator_name=st.session_state.get('-USER_NAME-'),
|
| 59 |
test_taker_name=st.session_state.get('-USER_NAME-'),
|
| 60 |
-
test_taker_answers=RETURN_TEST_DATA,
|
| 61 |
generated_result=result,
|
|
|
|
| 62 |
test_taker_result={'Баллов': points, 'Всего': len_answers, 'Оценка': test_mark},
|
| 63 |
comments=COMMENTS)
|
| 64 |
elif st.session_state.get('-LOGGED_IN_BOOL-'):
|
|
|
|
| 4 |
from utilities_database.user_database_utils import save_data_in_database
|
| 5 |
from utilities_database.user_database_widgets import user_save_text_table
|
| 6 |
|
| 7 |
+
st.set_page_config(page_title='Онлайн-тест', layout="wide", page_icon=':ru:', initial_sidebar_state='collapsed')
|
| 8 |
if st.session_state.get('-ONLINE_TEST_READY-') and st.session_state.get('-LOGGED_IN_BOOL-'):
|
| 9 |
+
INSTRUCTION = st.expander(label='**ИНСТРУКЦИЯ**', expanded=False)
|
| 10 |
INSTRUCTION.markdown(
|
| 11 |
'Уважаемые пользователи, предлагаем Вам заполнить опросник по оценке качества созданных заданий. '
|
| 12 |
'\n\nНиже находится анкета с заданиями в таблице.'
|
| 13 |
'\n\n- В **первом столбце** приводится ответ - слово, удаленное из оригинального текста.'
|
| 14 |
'\n\n- Отметьте во **втором столбце**, уместно ли создавать задание с данным словом.'
|
| 15 |
'\n\n- В **третьем столбце** приведены подобранные программой дистракторы.'
|
| 16 |
+
'\n\n- Введите в **четвертый столбец** дистракторы (целиком или букву), которые, по Вашему мнению,'
|
| 17 |
+
' **:red[не уместны]**. '
|
| 18 |
+
'\n\n**:green[Уместными дистракторами]** мы предлагаем считать те, которые одновременно удовлетворяют'
|
| 19 |
+
' следующим условиям в рамках языкового уровня, для которого они созданы:'
|
| 20 |
'\n\n1. не слишком очевидно являются неправильными вариантами (*варить суп/стол*);'
|
| 21 |
'\n\n2. и при этом не могут быть полноценной заменой удаленного слова (*варить суп/кашу*)'
|
| 22 |
)
|
|
|
|
| 31 |
ONLINE_TEST = st.form('Онлайн тест')
|
| 32 |
ONLINE_TEST.write(result['TEXT_WITH_GAPS'].replace('_', '\_'))
|
| 33 |
BAD_DISTRACTORS_AND_ANSWERS_temp = ONLINE_TEST.data_editor(
|
| 34 |
+
pd.DataFrame([{"Задание №": i + 1,
|
| 35 |
"Ответ": [answers[i][1]],
|
| 36 |
"Задание уместно": False,
|
| 37 |
"Дистракторы": tasks[i][1],
|
| 38 |
"Неуместные дистракторы": ''}
|
| 39 |
for i in range(len(tasks))]),
|
| 40 |
num_rows="fixed",
|
| 41 |
+
height=40 * len_answers,
|
| 42 |
+
hide_index=True,
|
| 43 |
use_container_width=True)
|
| 44 |
+
COMMENTS = ONLINE_TEST.text_area(label='**Прокомментировать**',
|
| 45 |
+
placeholder='Напишите комментарий')
|
| 46 |
+
SUBMIT = ONLINE_TEST.form_submit_button('ГОТОВО')
|
| 47 |
if SUBMIT:
|
| 48 |
points = test_mark = 'Teacher'
|
| 49 |
appropriate_tasks = BAD_DISTRACTORS_AND_ANSWERS_temp["Задание уместно"].values.tolist()
|
| 50 |
inappropriate_distractors = BAD_DISTRACTORS_AND_ANSWERS_temp["Неуместные дистракторы"].values.tolist()
|
| 51 |
RETURN_TEST_DATA = [{'ANSWER': answers[i],
|
| 52 |
'APPROPRIATE_TASK': appropriate_tasks[i],
|
| 53 |
+
'INAPPROPRIATE_DISTRACTORS': inappropriate_distractors[i]} for i in range(len_answers)]
|
| 54 |
save_data_in_database(user_task_database=user_save_text_table,
|
| 55 |
save_type='online_test',
|
| 56 |
save_name=st.session_state['-UPLOAD_CLOUD_FILE_NAME-'],
|
|
|
|
| 58 |
created_at=str(datetime.datetime.now())[:-7],
|
| 59 |
creator_name=st.session_state.get('-USER_NAME-'),
|
| 60 |
test_taker_name=st.session_state.get('-USER_NAME-'),
|
|
|
|
| 61 |
generated_result=result,
|
| 62 |
+
test_taker_answers=RETURN_TEST_DATA,
|
| 63 |
test_taker_result={'Баллов': points, 'Всего': len_answers, 'Оценка': test_mark},
|
| 64 |
comments=COMMENTS)
|
| 65 |
elif st.session_state.get('-LOGGED_IN_BOOL-'):
|
requirements.txt
CHANGED
|
@@ -1,15 +1,17 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
|
|
|
|
|
|
| 15 |
ru_core_news_lg @ https://github.com/explosion/spacy-models/releases/download/ru_core_news_lg-3.7.0/ru_core_news_lg-3.7.0-py3-none-any.whl
|
|
|
|
| 1 |
+
nltk==3.8.1
|
| 2 |
+
spacy==3.7.2
|
| 3 |
+
torch==2.1.0
|
| 4 |
+
gensim==4.3.2
|
| 5 |
+
pandas==2.2.0
|
| 6 |
+
catboost==1.2.2
|
| 7 |
+
supabase==2.4.0
|
| 8 |
+
requests==2.31.0
|
| 9 |
+
pymorphy2==0.9.1
|
| 10 |
+
trycourier==5.0.0
|
| 11 |
+
streamlit==1.32.2
|
| 12 |
+
argon2-cffi==21.1.0
|
| 13 |
+
cryptography==42.0.3
|
| 14 |
+
transformers==4.38.2
|
| 15 |
+
streamlit-extras==0.4.0
|
| 16 |
+
bert-extractive-summarizer==0.10.1
|
| 17 |
ru_core_news_lg @ https://github.com/explosion/spacy-models/releases/download/ru_core_news_lg-3.7.0/ru_core_news_lg-3.7.0-py3-none-any.whl
|
utilities_cookies/cookie_manager.py
CHANGED
|
@@ -1,10 +1,8 @@
|
|
| 1 |
import streamlit as st
|
| 2 |
from pathlib import Path
|
| 3 |
-
from typing import Mapping
|
| 4 |
-
from datetime import datetime
|
| 5 |
-
from datetime import timedelta
|
| 6 |
from urllib.parse import unquote
|
| 7 |
-
from
|
|
|
|
| 8 |
from streamlit.components.v1 import components
|
| 9 |
|
| 10 |
|
|
|
|
| 1 |
import streamlit as st
|
| 2 |
from pathlib import Path
|
|
|
|
|
|
|
|
|
|
| 3 |
from urllib.parse import unquote
|
| 4 |
+
from datetime import datetime, timedelta
|
| 5 |
+
from typing import Mapping, MutableMapping
|
| 6 |
from streamlit.components.v1 import components
|
| 7 |
|
| 8 |
|
utilities_cookies/encrypted_cookie_manager.py
CHANGED
|
@@ -1,12 +1,10 @@
|
|
| 1 |
import os
|
| 2 |
import base64
|
| 3 |
import streamlit as st
|
| 4 |
-
from typing import Tuple
|
| 5 |
-
from typing import Optional
|
| 6 |
from cryptography import fernet
|
| 7 |
-
from typing import MutableMapping
|
| 8 |
from cryptography.fernet import Fernet
|
| 9 |
from cryptography.hazmat.primitives import hashes
|
|
|
|
| 10 |
from utilities_cookies.cookie_manager import CookieManager
|
| 11 |
from cryptography.hazmat.primitives.kdf.pbkdf2 import PBKDF2HMAC
|
| 12 |
|
|
|
|
| 1 |
import os
|
| 2 |
import base64
|
| 3 |
import streamlit as st
|
|
|
|
|
|
|
| 4 |
from cryptography import fernet
|
|
|
|
| 5 |
from cryptography.fernet import Fernet
|
| 6 |
from cryptography.hazmat.primitives import hashes
|
| 7 |
+
from typing import Tuple, Optional, MutableMapping
|
| 8 |
from utilities_cookies.cookie_manager import CookieManager
|
| 9 |
from cryptography.hazmat.primitives.kdf.pbkdf2 import PBKDF2HMAC
|
| 10 |
|
utilities_database/user_database_utils.py
CHANGED
|
@@ -1,9 +1,9 @@
|
|
| 1 |
-
import re
|
| 2 |
-
import json
|
| 3 |
-
import secrets
|
| 4 |
import pandas as pd
|
| 5 |
import streamlit as st
|
|
|
|
|
|
|
| 6 |
from trycourier import Courier
|
|
|
|
| 7 |
from argon2 import PasswordHasher
|
| 8 |
from argon2.exceptions import VerifyMismatchError
|
| 9 |
|
|
@@ -37,7 +37,7 @@ def check_valid_name(name_sign_up: str) -> bool:
|
|
| 37 |
name_regex_eng = r'^[A-Za-z_]\w *'
|
| 38 |
name_regex_rus = r'^[А-Яа-я_][А-Яа-я0-9_] *'
|
| 39 |
|
| 40 |
-
if
|
| 41 |
return True
|
| 42 |
return False
|
| 43 |
|
|
@@ -46,7 +46,7 @@ def check_valid_email(email_sign_up: str) -> bool:
|
|
| 46 |
"""
|
| 47 |
Checks if the user entered a valid email while creating the account.
|
| 48 |
"""
|
| 49 |
-
regex =
|
| 50 |
return True
|
| 51 |
|
| 52 |
# if re.fullmatch(regex, email_sign_up):
|
|
@@ -133,7 +133,7 @@ def generate_random_passwd() -> str:
|
|
| 133 |
Generates a random password to be sent in email.
|
| 134 |
"""
|
| 135 |
password_length = 10
|
| 136 |
-
return
|
| 137 |
|
| 138 |
|
| 139 |
def send_passwd_in_email(auth_token: str, user_name_forgot_passwd: str, email_forgot_passwd: str, company_name: str,
|
|
@@ -266,5 +266,5 @@ def load_users_particular_task(user_task_database, load_mode, creator_name, save
|
|
| 266 |
.eq('save_name', save_name)\
|
| 267 |
.eq('save_type', load_mode)\
|
| 268 |
.eq('cefr_level',cefr_level).execute().data[0]['generated_result']
|
| 269 |
-
return_data =
|
| 270 |
return return_data
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
import pandas as pd
|
| 2 |
import streamlit as st
|
| 3 |
+
from json import loads
|
| 4 |
+
from re import search, compile
|
| 5 |
from trycourier import Courier
|
| 6 |
+
from secrets import token_urlsafe
|
| 7 |
from argon2 import PasswordHasher
|
| 8 |
from argon2.exceptions import VerifyMismatchError
|
| 9 |
|
|
|
|
| 37 |
name_regex_eng = r'^[A-Za-z_]\w *'
|
| 38 |
name_regex_rus = r'^[А-Яа-я_][А-Яа-я0-9_] *'
|
| 39 |
|
| 40 |
+
if search(name_regex_eng, name_sign_up) or search(name_regex_rus, name_sign_up):
|
| 41 |
return True
|
| 42 |
return False
|
| 43 |
|
|
|
|
| 46 |
"""
|
| 47 |
Checks if the user entered a valid email while creating the account.
|
| 48 |
"""
|
| 49 |
+
regex = compile(r'([A-Za-z0-9]+[.-_])*[A-Za-z0-9]+@[A-Za-z0-9-]+(\.[A-Z|a-z]{2,})+')
|
| 50 |
return True
|
| 51 |
|
| 52 |
# if re.fullmatch(regex, email_sign_up):
|
|
|
|
| 133 |
Generates a random password to be sent in email.
|
| 134 |
"""
|
| 135 |
password_length = 10
|
| 136 |
+
return token_urlsafe(password_length)
|
| 137 |
|
| 138 |
|
| 139 |
def send_passwd_in_email(auth_token: str, user_name_forgot_passwd: str, email_forgot_passwd: str, company_name: str,
|
|
|
|
| 266 |
.eq('save_name', save_name)\
|
| 267 |
.eq('save_type', load_mode)\
|
| 268 |
.eq('cefr_level',cefr_level).execute().data[0]['generated_result']
|
| 269 |
+
return_data = loads(return_data.replace("'", '"'), strict=False)
|
| 270 |
return return_data
|
utilities_database/user_database_widgets.py
CHANGED
|
@@ -3,7 +3,6 @@ from datetime import datetime
|
|
| 3 |
from supabase import create_client, Client
|
| 4 |
from utilities_option_menu.option_menu import option_menu
|
| 5 |
import utilities_database.user_database_utils as db_utils
|
| 6 |
-
from utilities_database.user_database_utils import check_usr_pass
|
| 7 |
from utilities_cookies.encrypted_cookie_manager import EncryptedCookieManager
|
| 8 |
|
| 9 |
DB_URL = st.secrets['SUPABASE_URL']
|
|
@@ -91,9 +90,9 @@ class LogIn:
|
|
| 91 |
login_submit_button = st.form_submit_button(label='Войти')
|
| 92 |
|
| 93 |
if login_submit_button:
|
| 94 |
-
authenticate_user_check = check_usr_pass(user_log_in_database=user_login_table,
|
| 95 |
-
|
| 96 |
-
|
| 97 |
|
| 98 |
if not authenticate_user_check:
|
| 99 |
st.error("Неверное имя пользователя или пароль!")
|
|
|
|
| 3 |
from supabase import create_client, Client
|
| 4 |
from utilities_option_menu.option_menu import option_menu
|
| 5 |
import utilities_database.user_database_utils as db_utils
|
|
|
|
| 6 |
from utilities_cookies.encrypted_cookie_manager import EncryptedCookieManager
|
| 7 |
|
| 8 |
DB_URL = st.secrets['SUPABASE_URL']
|
|
|
|
| 90 |
login_submit_button = st.form_submit_button(label='Войти')
|
| 91 |
|
| 92 |
if login_submit_button:
|
| 93 |
+
authenticate_user_check = db_utils.check_usr_pass(user_log_in_database=user_login_table,
|
| 94 |
+
user_name=user_name,
|
| 95 |
+
password=password)
|
| 96 |
|
| 97 |
if not authenticate_user_check:
|
| 98 |
st.error("Неверное имя пользователя или пароль!")
|
utilities_language_bert/rus_main_workflow_bert.py
CHANGED
|
@@ -1,28 +1,20 @@
|
|
| 1 |
import datetime
|
| 2 |
from io import StringIO
|
|
|
|
| 3 |
from random import sample
|
| 4 |
from collections import defaultdict
|
| 5 |
-
from streamlit import progress as st_progress
|
| 6 |
-
from utilities_language_general.rus_constants import st
|
| 7 |
-
from streamlit.elements import WIDGETS as ST_WIDGETS
|
| 8 |
-
from utilities_language_bert.rus_sentence_bert import TASK
|
| 9 |
-
from utilities_language_general.rus_constants import load_bert
|
| 10 |
-
from utilities_language_general.rus_utils import prepare_tasks
|
| 11 |
-
from utilities_language_bert.rus_sentence_bert import SENTENCE
|
| 12 |
-
import utilities_language_general.rus_constants as esp_constants
|
| 13 |
-
from utilities_language_general.rus_utils import prepare_target_words
|
| 14 |
-
from utilities_language_general.rus_utils import compute_frequency_dict
|
| 15 |
from streamlit.runtime.uploaded_file_manager import UploadedFile
|
| 16 |
-
from
|
|
|
|
|
|
|
| 17 |
|
| 18 |
|
| 19 |
-
def
|
| 20 |
-
file: UploadedFile
|
| 21 |
text: str,
|
| 22 |
-
logs
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
progress_s: st_progress,
|
| 26 |
level: str,
|
| 27 |
tw_mode_automatic_mode: str,
|
| 28 |
target_words: str,
|
|
@@ -31,13 +23,15 @@ def main_workflow_bert(
|
|
| 31 |
global_bad_target_words=BAD_USER_TARGET_WORDS):
|
| 32 |
|
| 33 |
# Clear bad target_words each time
|
| 34 |
-
global_bad_target_words
|
|
|
|
| 35 |
|
| 36 |
# Define main global variables
|
| 37 |
-
logs.write()
|
| 38 |
GLOBAL_DISTRACTORS = set()
|
| 39 |
MAX_FREQUENCY = 0
|
| 40 |
|
|
|
|
|
|
|
| 41 |
mask_filler = load_bert()
|
| 42 |
|
| 43 |
# Get input text
|
|
@@ -47,15 +41,15 @@ def main_workflow_bert(
|
|
| 47 |
elif text != '':
|
| 48 |
current_text = text
|
| 49 |
else:
|
| 50 |
-
|
| 51 |
current_text = ''
|
| 52 |
-
|
| 53 |
|
| 54 |
# Process target words
|
| 55 |
if tw_mode_automatic_mode == 'Самостоятельно':
|
| 56 |
if target_words == '':
|
| 57 |
-
|
| 58 |
-
|
| 59 |
# Cannot make up paradigm, so only USER_TARGET_WORDS is used
|
| 60 |
USER_TARGET_WORDS = prepare_target_words(target_words)
|
| 61 |
tw_mode_automatic_mode = False
|
|
@@ -68,8 +62,8 @@ def main_workflow_bert(
|
|
| 68 |
current_text = current_text.replace('.', '. ').replace('. . .', '...').replace(' ', ' ').replace('…', '...') \
|
| 69 |
.replace('…', '...').replace('—', '-').replace('\u2014', '-').replace('—', '-').replace('-\n', '') \
|
| 70 |
.replace('\n', '%^&*')
|
| 71 |
-
current_text_sentences = [sent.text.strip() for sent in
|
| 72 |
-
logs.
|
| 73 |
progress.progress(10)
|
| 74 |
|
| 75 |
# Compute frequency dict
|
|
@@ -81,31 +75,12 @@ def main_workflow_bert(
|
|
| 81 |
if j < len(FREQ_DICT) * _frequency_barrier_percent:
|
| 82 |
MAX_FREQUENCY = tp[1]
|
| 83 |
MAX_FREQUENCY = 3 if MAX_FREQUENCY < 3 else MAX_FREQUENCY
|
| 84 |
-
logs.
|
| 85 |
progress.progress(15)
|
| 86 |
|
| 87 |
# Choose necessary language minimum according to user's input
|
| 88 |
-
if level
|
| 89 |
-
target_minimum =
|
| 90 |
-
distractor_minimum = esp_constants.a1_distractor_set
|
| 91 |
-
elif level == 'A2':
|
| 92 |
-
target_minimum = esp_constants.a2_target_set
|
| 93 |
-
distractor_minimum = esp_constants.a2_distractor_set
|
| 94 |
-
elif level == 'B1':
|
| 95 |
-
target_minimum = esp_constants.b1_target_set
|
| 96 |
-
distractor_minimum = esp_constants.b1_distractor_set
|
| 97 |
-
elif level == 'B2':
|
| 98 |
-
target_minimum = esp_constants.b2_target_set
|
| 99 |
-
distractor_minimum = esp_constants.b2_distractor_set
|
| 100 |
-
elif level == 'C1':
|
| 101 |
-
target_minimum = esp_constants.c1_target_set
|
| 102 |
-
distractor_minimum = esp_constants.c1_distractor_set
|
| 103 |
-
elif level == 'C2':
|
| 104 |
-
target_minimum = esp_constants.c2_target_set
|
| 105 |
-
distractor_minimum = esp_constants.c2_distractor_set
|
| 106 |
-
elif level == 'Без уровня':
|
| 107 |
-
target_minimum = None
|
| 108 |
-
distractor_minimum = None
|
| 109 |
else:
|
| 110 |
target_minimum = None
|
| 111 |
distractor_minimum = None
|
|
@@ -115,24 +90,41 @@ def main_workflow_bert(
|
|
| 115 |
# Start generation process
|
| 116 |
workflow = [SENTENCE(original=sent.strip(), n_sentence=num, max_num_distractors=num_distractors)
|
| 117 |
for num, sent in enumerate(current_text_sentences)]
|
| 118 |
-
logs.
|
| 119 |
progress.progress(20)
|
| 120 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 121 |
for sentence in workflow:
|
| 122 |
sentence.lemmatize_sentence()
|
| 123 |
|
| 124 |
for sentence in workflow:
|
| 125 |
sentence.bind_phrases()
|
| 126 |
-
logs.
|
| 127 |
progress.progress(30)
|
| 128 |
|
| 129 |
for j, sentence in enumerate(workflow):
|
| 130 |
sentence.search_target_words(target_words_automatic_mode=tw_mode_automatic_mode,
|
| 131 |
target_minimum=target_minimum,
|
| 132 |
user_target_words=USER_TARGET_WORDS,
|
| 133 |
-
frequency_dict=FREQ_DICT
|
|
|
|
| 134 |
progress.progress(int(30 + (j * (20 / len(workflow)))))
|
| 135 |
-
|
| 136 |
DUPLICATE_TARGET_WORDS = defaultdict(list)
|
| 137 |
for sentence in workflow:
|
| 138 |
for target_word in sentence.target_words:
|
|
@@ -145,8 +137,8 @@ def main_workflow_bert(
|
|
| 145 |
if target_word not in RESULT_TW:
|
| 146 |
global_bad_target_words.append(target_word['original_text'])
|
| 147 |
sentence.target_words.remove(target_word)
|
| 148 |
-
|
| 149 |
-
logs.
|
| 150 |
|
| 151 |
for sentence in workflow:
|
| 152 |
for i, target_word in enumerate(sentence.target_words):
|
|
@@ -157,7 +149,7 @@ def main_workflow_bert(
|
|
| 157 |
|
| 158 |
for sentence in workflow:
|
| 159 |
sentence.filter_target_words(target_words_automatic_mode=tw_mode_automatic_mode)
|
| 160 |
-
|
| 161 |
|
| 162 |
RESULT_TASKS = []
|
| 163 |
for sentence in workflow:
|
|
@@ -166,25 +158,29 @@ def main_workflow_bert(
|
|
| 166 |
RESULT_TASKS.append(task)
|
| 167 |
|
| 168 |
for num, task in enumerate(RESULT_TASKS):
|
| 169 |
-
task.attach_distractors_to_target_word(model=mask_filler,
|
|
|
|
|
|
|
|
|
|
|
|
|
| 170 |
global_distractors=GLOBAL_DISTRACTORS,
|
| 171 |
distractor_minimum=distractor_minimum,
|
| 172 |
max_frequency=MAX_FREQUENCY)
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
logs.
|
| 179 |
|
| 180 |
for task in RESULT_TASKS:
|
| 181 |
task.inflect_distractors()
|
| 182 |
-
|
| 183 |
-
logs.
|
| 184 |
|
| 185 |
for task in RESULT_TASKS:
|
| 186 |
task.sample_distractors(num_distractors=num_distractors)
|
| 187 |
-
|
| 188 |
RESULT_TASKS = list(filter(lambda t: not t.bad_target_word, RESULT_TASKS))
|
| 189 |
|
| 190 |
for task in RESULT_TASKS[::-1]:
|
|
@@ -202,13 +198,18 @@ def main_workflow_bert(
|
|
| 202 |
NUMBER_TASKS = 10
|
| 203 |
else:
|
| 204 |
NUMBER_TASKS = len(RESULT_TASKS)
|
| 205 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 206 |
RESULT_TASKS = sorted(RESULT_TASKS, key=lambda t: (t.sentence_number, t.position_in_sentence))
|
| 207 |
|
| 208 |
for task in RESULT_TASKS:
|
| 209 |
task.compile_task(max_num_distractors=num_distractors)
|
| 210 |
-
|
| 211 |
-
logs.
|
| 212 |
|
| 213 |
TEXT_WITH_GAPS = []
|
| 214 |
VARIANTS = []
|
|
@@ -222,9 +223,6 @@ def main_workflow_bert(
|
|
| 222 |
TEXT_WITH_GAPS.append(sentence)
|
| 223 |
del RESULT_TASKS
|
| 224 |
|
| 225 |
-
logs.success('Сейчас все будет готово!')
|
| 226 |
-
progress_s.progress(90)
|
| 227 |
-
|
| 228 |
TEXT_WITH_GAPS = ' '.join([sentence for sentence in TEXT_WITH_GAPS]).replace('%^&*', '\n')
|
| 229 |
PREPARED_TASKS = prepare_tasks(VARIANTS)
|
| 230 |
STUDENT_OUT = f'{TEXT_WITH_GAPS}\n\n{"=" * 70}\n\n{PREPARED_TASKS["TASKS_STUDENT"]}'
|
|
@@ -232,8 +230,8 @@ def main_workflow_bert(
|
|
| 232 |
f'{PREPARED_TASKS["KEYS_ONLY"]}'
|
| 233 |
TOTAL_OUT = f'{original_text}\n\n{"$" * 70}\n\n{STUDENT_OUT}\n\n{"=" * 70}\n\n{PREPARED_TASKS["TASKS_TEACHER"]}' \
|
| 234 |
f'\n\n{"$" * 70}\n\n{PREPARED_TASKS["KEYS_ONLY"]}'
|
| 235 |
-
logs.
|
| 236 |
-
|
| 237 |
save_name = save_name if save_name != '' else f'{str(datetime.datetime.now())[:-7]}_{original_text[:20]}'
|
| 238 |
out = {
|
| 239 |
'name': save_name,
|
|
|
|
| 1 |
import datetime
|
| 2 |
from io import StringIO
|
| 3 |
+
from typing import Union
|
| 4 |
from random import sample
|
| 5 |
from collections import defaultdict
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
from streamlit.runtime.uploaded_file_manager import UploadedFile
|
| 7 |
+
from utilities_language_bert.rus_sentence_bert import TASK, SENTENCE
|
| 8 |
+
from utilities_language_general.rus_utils import compute_frequency_dict, prepare_tasks, prepare_target_words
|
| 9 |
+
from utilities_language_general.rus_constants import st, load_bert, load_classifiers, nlp, summarization, BAD_USER_TARGET_WORDS, MINIMUM_SETS
|
| 10 |
|
| 11 |
|
| 12 |
+
def main_workflow(
|
| 13 |
+
file: Union[UploadedFile, None],
|
| 14 |
text: str,
|
| 15 |
+
logs,
|
| 16 |
+
progress,
|
| 17 |
+
progress_d,
|
|
|
|
| 18 |
level: str,
|
| 19 |
tw_mode_automatic_mode: str,
|
| 20 |
target_words: str,
|
|
|
|
| 23 |
global_bad_target_words=BAD_USER_TARGET_WORDS):
|
| 24 |
|
| 25 |
# Clear bad target_words each time
|
| 26 |
+
if global_bad_target_words:
|
| 27 |
+
global_bad_target_words = []
|
| 28 |
|
| 29 |
# Define main global variables
|
|
|
|
| 30 |
GLOBAL_DISTRACTORS = set()
|
| 31 |
MAX_FREQUENCY = 0
|
| 32 |
|
| 33 |
+
logs.update(label='Загружаем языковые модели и другие данные', state='running')
|
| 34 |
+
pos_dict, scaler, classifier = load_classifiers('model3')
|
| 35 |
mask_filler = load_bert()
|
| 36 |
|
| 37 |
# Get input text
|
|
|
|
| 41 |
elif text != '':
|
| 42 |
current_text = text
|
| 43 |
else:
|
| 44 |
+
st.warning('Вы ни текст не вставили, ни файл не выбрали 😢')
|
| 45 |
current_text = ''
|
| 46 |
+
st.stop()
|
| 47 |
|
| 48 |
# Process target words
|
| 49 |
if tw_mode_automatic_mode == 'Самостоятельно':
|
| 50 |
if target_words == '':
|
| 51 |
+
st.warning('Вы не ввели целевые слова')
|
| 52 |
+
st.stop()
|
| 53 |
# Cannot make up paradigm, so only USER_TARGET_WORDS is used
|
| 54 |
USER_TARGET_WORDS = prepare_target_words(target_words)
|
| 55 |
tw_mode_automatic_mode = False
|
|
|
|
| 62 |
current_text = current_text.replace('.', '. ').replace('. . .', '...').replace(' ', ' ').replace('…', '...') \
|
| 63 |
.replace('…', '...').replace('—', '-').replace('\u2014', '-').replace('—', '-').replace('-\n', '') \
|
| 64 |
.replace('\n', '%^&*')
|
| 65 |
+
current_text_sentences = [sent.text.strip() for sent in nlp(current_text).sents]
|
| 66 |
+
logs.update(label='Получили Ваш текст!', state='running')
|
| 67 |
progress.progress(10)
|
| 68 |
|
| 69 |
# Compute frequency dict
|
|
|
|
| 75 |
if j < len(FREQ_DICT) * _frequency_barrier_percent:
|
| 76 |
MAX_FREQUENCY = tp[1]
|
| 77 |
MAX_FREQUENCY = 3 if MAX_FREQUENCY < 3 else MAX_FREQUENCY
|
| 78 |
+
logs.update(label="Посчитали немного статистики!", state='running')
|
| 79 |
progress.progress(15)
|
| 80 |
|
| 81 |
# Choose necessary language minimum according to user's input
|
| 82 |
+
if level:
|
| 83 |
+
target_minimum, distractor_minimum = MINIMUM_SETS[level]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
else:
|
| 85 |
target_minimum = None
|
| 86 |
distractor_minimum = None
|
|
|
|
| 90 |
# Start generation process
|
| 91 |
workflow = [SENTENCE(original=sent.strip(), n_sentence=num, max_num_distractors=num_distractors)
|
| 92 |
for num, sent in enumerate(current_text_sentences)]
|
| 93 |
+
logs.update(label="Запускаем процесс генерации заданий!", state='running')
|
| 94 |
progress.progress(20)
|
| 95 |
|
| 96 |
+
# Define summary length
|
| 97 |
+
text_length = len(current_text_sentences)
|
| 98 |
+
if text_length <= 15:
|
| 99 |
+
summary_length = text_length
|
| 100 |
+
elif text_length <= 25:
|
| 101 |
+
summary_length = 15
|
| 102 |
+
else:
|
| 103 |
+
n = (text_length - 20) // 5
|
| 104 |
+
summary_length = 15 + 2 * n
|
| 105 |
+
round_summary_length = summary_length - (summary_length % - 10)
|
| 106 |
+
|
| 107 |
+
# Get summary. May choose between round_summary_length and summary_length
|
| 108 |
+
SUMMARY = summarization(current_text, num_sentences=round_summary_length)
|
| 109 |
+
logs.update('Нашли интересные предложения. Пригодятся!')
|
| 110 |
+
progress.progress(25)
|
| 111 |
+
|
| 112 |
for sentence in workflow:
|
| 113 |
sentence.lemmatize_sentence()
|
| 114 |
|
| 115 |
for sentence in workflow:
|
| 116 |
sentence.bind_phrases()
|
| 117 |
+
logs.update(label="Подготовили предложения для дальнейшей работы!", state='running')
|
| 118 |
progress.progress(30)
|
| 119 |
|
| 120 |
for j, sentence in enumerate(workflow):
|
| 121 |
sentence.search_target_words(target_words_automatic_mode=tw_mode_automatic_mode,
|
| 122 |
target_minimum=target_minimum,
|
| 123 |
user_target_words=USER_TARGET_WORDS,
|
| 124 |
+
frequency_dict=FREQ_DICT,
|
| 125 |
+
summary=SUMMARY)
|
| 126 |
progress.progress(int(30 + (j * (20 / len(workflow)))))
|
| 127 |
+
progress.progress(50)
|
| 128 |
DUPLICATE_TARGET_WORDS = defaultdict(list)
|
| 129 |
for sentence in workflow:
|
| 130 |
for target_word in sentence.target_words:
|
|
|
|
| 137 |
if target_word not in RESULT_TW:
|
| 138 |
global_bad_target_words.append(target_word['original_text'])
|
| 139 |
sentence.target_words.remove(target_word)
|
| 140 |
+
progress.progress(55)
|
| 141 |
+
logs.update(label='Выбрали слова-пропуски!', state='running')
|
| 142 |
|
| 143 |
for sentence in workflow:
|
| 144 |
for i, target_word in enumerate(sentence.target_words):
|
|
|
|
| 149 |
|
| 150 |
for sentence in workflow:
|
| 151 |
sentence.filter_target_words(target_words_automatic_mode=tw_mode_automatic_mode)
|
| 152 |
+
progress.progress(60)
|
| 153 |
|
| 154 |
RESULT_TASKS = []
|
| 155 |
for sentence in workflow:
|
|
|
|
| 158 |
RESULT_TASKS.append(task)
|
| 159 |
|
| 160 |
for num, task in enumerate(RESULT_TASKS):
|
| 161 |
+
task.attach_distractors_to_target_word(model=mask_filler,
|
| 162 |
+
scaler=scaler,
|
| 163 |
+
classifier=classifier,
|
| 164 |
+
pos_dict=pos_dict,
|
| 165 |
+
level_name=level,
|
| 166 |
global_distractors=GLOBAL_DISTRACTORS,
|
| 167 |
distractor_minimum=distractor_minimum,
|
| 168 |
max_frequency=MAX_FREQUENCY)
|
| 169 |
+
progress_d.progress(num / len(RESULT_TASKS))
|
| 170 |
+
logs.update(label=f'Обработали {num}/{len(RESULT_TASKS)} целевых слов!', state='running')
|
| 171 |
+
logs.update(label=f'Обработали {len(RESULT_TASKS)}/{len(RESULT_TASKS)} целевых слов!', state='running')
|
| 172 |
+
progress_d.progress(100)
|
| 173 |
+
progress.progress(70)
|
| 174 |
+
logs.update(label='Подобрали неправильные варианты!', state='running')
|
| 175 |
|
| 176 |
for task in RESULT_TASKS:
|
| 177 |
task.inflect_distractors()
|
| 178 |
+
progress.progress(80)
|
| 179 |
+
logs.update(label='Просклоняли и проспрягали неправильные варианты!', state='running')
|
| 180 |
|
| 181 |
for task in RESULT_TASKS:
|
| 182 |
task.sample_distractors(num_distractors=num_distractors)
|
| 183 |
+
progress.progress(85)
|
| 184 |
RESULT_TASKS = list(filter(lambda t: not t.bad_target_word, RESULT_TASKS))
|
| 185 |
|
| 186 |
for task in RESULT_TASKS[::-1]:
|
|
|
|
| 198 |
NUMBER_TASKS = 10
|
| 199 |
else:
|
| 200 |
NUMBER_TASKS = len(RESULT_TASKS)
|
| 201 |
+
RESULT_TASKS_in_summary = list(filter(lambda task: task.in_summary, RESULT_TASKS))
|
| 202 |
+
RESULT_TASTS_not_in_summary = list(filter(lambda task: not task.in_summary, RESULT_TASKS))
|
| 203 |
+
if len(RESULT_TASKS_in_summary) >= NUMBER_TASKS:
|
| 204 |
+
RESULT_TASKS = RESULT_TASKS_in_summary
|
| 205 |
+
else:
|
| 206 |
+
RESULT_TASKS = RESULT_TASKS_in_summary + sample(RESULT_TASTS_not_in_summary, NUMBER_TASKS - len(RESULT_TASKS_in_summary))
|
| 207 |
RESULT_TASKS = sorted(RESULT_TASKS, key=lambda t: (t.sentence_number, t.position_in_sentence))
|
| 208 |
|
| 209 |
for task in RESULT_TASKS:
|
| 210 |
task.compile_task(max_num_distractors=num_distractors)
|
| 211 |
+
progress.progress(90)
|
| 212 |
+
logs.update(label='Отобрали лучшие задания!', state='running')
|
| 213 |
|
| 214 |
TEXT_WITH_GAPS = []
|
| 215 |
VARIANTS = []
|
|
|
|
| 223 |
TEXT_WITH_GAPS.append(sentence)
|
| 224 |
del RESULT_TASKS
|
| 225 |
|
|
|
|
|
|
|
|
|
|
| 226 |
TEXT_WITH_GAPS = ' '.join([sentence for sentence in TEXT_WITH_GAPS]).replace('%^&*', '\n')
|
| 227 |
PREPARED_TASKS = prepare_tasks(VARIANTS)
|
| 228 |
STUDENT_OUT = f'{TEXT_WITH_GAPS}\n\n{"=" * 70}\n\n{PREPARED_TASKS["TASKS_STUDENT"]}'
|
|
|
|
| 230 |
f'{PREPARED_TASKS["KEYS_ONLY"]}'
|
| 231 |
TOTAL_OUT = f'{original_text}\n\n{"$" * 70}\n\n{STUDENT_OUT}\n\n{"=" * 70}\n\n{PREPARED_TASKS["TASKS_TEACHER"]}' \
|
| 232 |
f'\n\n{"$" * 70}\n\n{PREPARED_TASKS["KEYS_ONLY"]}'
|
| 233 |
+
logs.update(label='Сейчас все будет готово!', state='running')
|
| 234 |
+
progress.progress(95)
|
| 235 |
save_name = save_name if save_name != '' else f'{str(datetime.datetime.now())[:-7]}_{original_text[:20]}'
|
| 236 |
out = {
|
| 237 |
'name': save_name,
|
utilities_language_bert/rus_sentence_bert.py
CHANGED
|
@@ -1,15 +1,7 @@
|
|
| 1 |
-
import copy
|
| 2 |
import string
|
| 3 |
-
from random import random
|
| 4 |
-
from
|
| 5 |
-
from utilities_language_general.
|
| 6 |
-
from utilities_language_general.rus_utils import get_tags
|
| 7 |
-
from utilities_language_general.rus_constants import PHRASES
|
| 8 |
-
from utilities_language_general.rus_utils import define_gender
|
| 9 |
-
from utilities_language_general.rus_utils import make_inflection
|
| 10 |
-
from utilities_language_general.rus_utils import check_token_bert
|
| 11 |
-
from utilities_language_general.rus_constants import BAD_USER_TARGET_WORDS
|
| 12 |
-
from utilities_language_general.rus_utils import get_distractors_from_model_bert
|
| 13 |
|
| 14 |
|
| 15 |
class SENTENCE:
|
|
@@ -47,11 +39,10 @@ class SENTENCE:
|
|
| 47 |
if not previous_was_phrase:
|
| 48 |
self.sentence_phrases.append(self.sentence_lemma_pos[i][1])
|
| 49 |
previous_was_phrase = False
|
|
|
|
| 50 |
|
| 51 |
-
def search_target_words_automatically(self, target_minimum: set, frequency_dict: dict = None):
|
| 52 |
for token in self.sentence_phrases:
|
| 53 |
-
# TODO: Still do not have w2v model with phrases
|
| 54 |
-
# therefore cannot come up with the criteria
|
| 55 |
if isinstance(token, list): # if token is a phrase
|
| 56 |
original_token1 = token[1]['original_token1']
|
| 57 |
original_token2 = token[1]['original_token2']
|
|
@@ -71,7 +62,8 @@ class SENTENCE:
|
|
| 71 |
'tags': tags,
|
| 72 |
'position_in_sentence': self.original.find(original_token1.text),
|
| 73 |
'not_named_entity': not_ner,
|
| 74 |
-
'frequency_in_text': 0
|
|
|
|
| 75 |
}
|
| 76 |
self.target_words.append(target_word)
|
| 77 |
else: # if token is just a spacy.nlp token
|
|
@@ -89,10 +81,11 @@ class SENTENCE:
|
|
| 89 |
'position_in_sentence': self.original.find(token.text),
|
| 90 |
'not_named_entity': True if token.ent_type == 0 else False,
|
| 91 |
'frequency_in_text': frequency_dict.get(token.lemma_, 1),
|
|
|
|
| 92 |
}
|
| 93 |
self.target_words.append(target_word)
|
| 94 |
|
| 95 |
-
def search_user_target_words(self, user_target_words: set = None, frequency_dict: dict = None):
|
| 96 |
for _utw in user_target_words:
|
| 97 |
if _utw in self.original:
|
| 98 |
parse_utw = nlp(_utw)
|
|
@@ -118,19 +111,20 @@ class SENTENCE:
|
|
| 118 |
'tags': user_target_word_tags,
|
| 119 |
'position_in_sentence': self.original.find(_utw),
|
| 120 |
'not_named_entity': not_ner,
|
| 121 |
-
'frequency_in_text': frequency_dict.get(user_target_word_lemma, 1)
|
|
|
|
| 122 |
}
|
| 123 |
self.target_words.append(target_word)
|
| 124 |
|
| 125 |
def search_target_words(self, target_words_automatic_mode: bool, target_minimum,
|
| 126 |
user_target_words: set = None,
|
| 127 |
-
frequency_dict: dict = None):
|
| 128 |
if target_words_automatic_mode:
|
| 129 |
self.search_target_words_automatically(target_minimum=target_minimum,
|
| 130 |
-
frequency_dict=frequency_dict)
|
| 131 |
else:
|
| 132 |
self.search_user_target_words(user_target_words=user_target_words,
|
| 133 |
-
frequency_dict=frequency_dict)
|
| 134 |
|
| 135 |
def filter_target_words(self, target_words_automatic_mode):
|
| 136 |
c_position = 0
|
|
@@ -170,17 +164,11 @@ class TASK:
|
|
| 170 |
def __repr__(self):
|
| 171 |
return '\n'.join([f'{key}\t=\t{value}' for key, value in self.__dict__.items()])
|
| 172 |
|
| 173 |
-
def attach_distractors_to_target_word(self, model,
|
| 174 |
-
level_name, max_frequency):
|
| 175 |
pos = self.pos[0] if self.pos[0] == 'phrase' else self.pos[1]
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
# text_with_masked_task=self.text_with_masked_task,
|
| 179 |
-
# global_distractors=global_distractors,
|
| 180 |
-
# distractor_minimum=distractor_minimum,
|
| 181 |
-
# max_num_distractors=self.max_num_distractors)
|
| 182 |
-
distractors_sentence = get_distractors_from_model_bert(model=model, lemma=self.lemma, pos=pos,
|
| 183 |
-
gender=self.gender, level_name=level_name,
|
| 184 |
text_with_masked_task=self.masked_sentence,
|
| 185 |
global_distractors=global_distractors,
|
| 186 |
distractor_minimum=distractor_minimum,
|
|
@@ -201,17 +189,6 @@ class TASK:
|
|
| 201 |
inflected = make_inflection(text=distractor_lemma, pos=self.pos[1], tags=self.tags)
|
| 202 |
if inflected is not None:
|
| 203 |
inflected_distractors.append(inflected)
|
| 204 |
-
else:
|
| 205 |
-
new_tags = copy.deepcopy(self.tags)
|
| 206 |
-
if 'NOUN' in self.tags and 'inan' in self.tags:
|
| 207 |
-
new_tags.discard('inan')
|
| 208 |
-
new_tags.add('anim')
|
| 209 |
-
elif 'NOUN' in self.tags and 'anim' in self.tags:
|
| 210 |
-
new_tags.discard('anim')
|
| 211 |
-
new_tags.add('inan')
|
| 212 |
-
inflected = make_inflection(text=distractor_lemma, pos=self.pos[1], tags=new_tags)
|
| 213 |
-
if inflected is not None:
|
| 214 |
-
inflected_distractors.append(inflected)
|
| 215 |
num_distractors = min(4, self.max_num_distractors) if self.max_num_distractors >= 4 \
|
| 216 |
else self.max_num_distractors
|
| 217 |
if len(inflected_distractors) < num_distractors:
|
|
|
|
|
|
|
| 1 |
import string
|
| 2 |
+
from random import random, sample
|
| 3 |
+
from utilities_language_general.rus_constants import nlp, PHRASES, BAD_USER_TARGET_WORDS
|
| 4 |
+
from utilities_language_general.rus_utils import get_tags, define_gender, make_inflection, check_token_bert, get_distractors_from_model_bert
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
|
| 6 |
|
| 7 |
class SENTENCE:
|
|
|
|
| 39 |
if not previous_was_phrase:
|
| 40 |
self.sentence_phrases.append(self.sentence_lemma_pos[i][1])
|
| 41 |
previous_was_phrase = False
|
| 42 |
+
self.sentence_phrases.append(self.sentence_lemma_pos[-1][1])
|
| 43 |
|
| 44 |
+
def search_target_words_automatically(self, target_minimum: set, frequency_dict: dict = None, summary: list=None):
|
| 45 |
for token in self.sentence_phrases:
|
|
|
|
|
|
|
| 46 |
if isinstance(token, list): # if token is a phrase
|
| 47 |
original_token1 = token[1]['original_token1']
|
| 48 |
original_token2 = token[1]['original_token2']
|
|
|
|
| 62 |
'tags': tags,
|
| 63 |
'position_in_sentence': self.original.find(original_token1.text),
|
| 64 |
'not_named_entity': not_ner,
|
| 65 |
+
'frequency_in_text': 0,
|
| 66 |
+
'in_summary': self.original in summary
|
| 67 |
}
|
| 68 |
self.target_words.append(target_word)
|
| 69 |
else: # if token is just a spacy.nlp token
|
|
|
|
| 81 |
'position_in_sentence': self.original.find(token.text),
|
| 82 |
'not_named_entity': True if token.ent_type == 0 else False,
|
| 83 |
'frequency_in_text': frequency_dict.get(token.lemma_, 1),
|
| 84 |
+
'in_summary': self.original in summary
|
| 85 |
}
|
| 86 |
self.target_words.append(target_word)
|
| 87 |
|
| 88 |
+
def search_user_target_words(self, user_target_words: set = None, frequency_dict: dict = None, summary: list=None):
|
| 89 |
for _utw in user_target_words:
|
| 90 |
if _utw in self.original:
|
| 91 |
parse_utw = nlp(_utw)
|
|
|
|
| 111 |
'tags': user_target_word_tags,
|
| 112 |
'position_in_sentence': self.original.find(_utw),
|
| 113 |
'not_named_entity': not_ner,
|
| 114 |
+
'frequency_in_text': frequency_dict.get(user_target_word_lemma, 1),
|
| 115 |
+
'in_summary': self.original in summary
|
| 116 |
}
|
| 117 |
self.target_words.append(target_word)
|
| 118 |
|
| 119 |
def search_target_words(self, target_words_automatic_mode: bool, target_minimum,
|
| 120 |
user_target_words: set = None,
|
| 121 |
+
frequency_dict: dict = None, summary: list=None):
|
| 122 |
if target_words_automatic_mode:
|
| 123 |
self.search_target_words_automatically(target_minimum=target_minimum,
|
| 124 |
+
frequency_dict=frequency_dict, summary=summary)
|
| 125 |
else:
|
| 126 |
self.search_user_target_words(user_target_words=user_target_words,
|
| 127 |
+
frequency_dict=frequency_dict, summary=summary)
|
| 128 |
|
| 129 |
def filter_target_words(self, target_words_automatic_mode):
|
| 130 |
c_position = 0
|
|
|
|
| 164 |
def __repr__(self):
|
| 165 |
return '\n'.join([f'{key}\t=\t{value}' for key, value in self.__dict__.items()])
|
| 166 |
|
| 167 |
+
def attach_distractors_to_target_word(self, model, scaler, classifier, pos_dict,
|
| 168 |
+
global_distractors, distractor_minimum, level_name, max_frequency):
|
| 169 |
pos = self.pos[0] if self.pos[0] == 'phrase' else self.pos[1]
|
| 170 |
+
distractors_sentence = get_distractors_from_model_bert(model=model, scaler=scaler, classifier=classifier, pos_dict=pos_dict,
|
| 171 |
+
level_name=level_name, lemma=self.lemma, pos=pos, gender=self.gender,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 172 |
text_with_masked_task=self.masked_sentence,
|
| 173 |
global_distractors=global_distractors,
|
| 174 |
distractor_minimum=distractor_minimum,
|
|
|
|
| 189 |
inflected = make_inflection(text=distractor_lemma, pos=self.pos[1], tags=self.tags)
|
| 190 |
if inflected is not None:
|
| 191 |
inflected_distractors.append(inflected)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 192 |
num_distractors = min(4, self.max_num_distractors) if self.max_num_distractors >= 4 \
|
| 193 |
else self.max_num_distractors
|
| 194 |
if len(inflected_distractors) < num_distractors:
|
utilities_language_general/rus_constants.py
CHANGED
|
@@ -3,7 +3,9 @@ import spacy
|
|
| 3 |
import gensim
|
| 4 |
import pymorphy2
|
| 5 |
import streamlit as st
|
|
|
|
| 6 |
from transformers import pipeline
|
|
|
|
| 7 |
|
| 8 |
|
| 9 |
@st.cache_resource
|
|
@@ -13,24 +15,53 @@ def load_morph():
|
|
| 13 |
|
| 14 |
|
| 15 |
@st.cache_resource
|
| 16 |
-
def load_w2v(
|
| 17 |
-
|
| 18 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
|
| 21 |
@st.cache_resource
|
| 22 |
def load_spacy():
|
| 23 |
-
|
|
|
|
| 24 |
return _nlp
|
| 25 |
|
| 26 |
|
| 27 |
@st.cache_resource
|
| 28 |
def load_bert():
|
| 29 |
-
|
|
|
|
|
|
|
| 30 |
|
| 31 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
nlp = load_spacy()
|
| 33 |
morph = load_morph()
|
|
|
|
| 34 |
w2v_model1_path = r'model1.gz'
|
| 35 |
w2v_model2_path = r'model2.gz'
|
| 36 |
|
|
@@ -47,24 +78,49 @@ b1_path, b1_target_set = r'language_data/B1_MINIMUM.txt', set()
|
|
| 47 |
b2_path, b2_target_set = r'language_data/B2_MINIMUM.txt', set()
|
| 48 |
c1_path, c1_target_set = r'language_data/C1_MINIMUM.txt', set()
|
| 49 |
c2_path, c2_target_set = r'language_data/C2_MINIMUM.txt', set()
|
| 50 |
-
|
|
|
|
| 51 |
minimums_sets = (a1_target_set, a2_target_set, b1_target_set, b2_target_set, c1_target_set, c2_target_set)
|
| 52 |
for i in range(len(minimums_paths)):
|
| 53 |
with open(minimums_paths[i], 'r', encoding='utf-8') as read_file:
|
| 54 |
for line in read_file:
|
| 55 |
minimums_sets[i].add(line.strip())
|
| 56 |
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
with open('language_data/phrases.json', 'r', encoding='utf-8') as f:
|
| 65 |
PHRASES = set(json.load(f)['PHRASES'])
|
| 66 |
|
| 67 |
-
SIMILARITY_VALUES_w2v = {'A1': 1.0, 'A2': 1.0, 'B1': 1.0, 'B2': 1.0, 'C1': 1.0, 'C2': 1.0, 'Без уровня': 1.0}
|
| 68 |
-
SIMILARITY_VALUES_bert = {'A1': 1.0, 'A2': 1.0, 'B1': 1.0, 'B2': 1.0, 'C1': 1.0, 'C2': 1.0, 'Без уровня': 1.0}
|
| 69 |
-
|
| 70 |
BAD_USER_TARGET_WORDS = []
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
import gensim
|
| 4 |
import pymorphy2
|
| 5 |
import streamlit as st
|
| 6 |
+
from pickle import load
|
| 7 |
from transformers import pipeline
|
| 8 |
+
from summarizer import Summarizer
|
| 9 |
|
| 10 |
|
| 11 |
@st.cache_resource
|
|
|
|
| 15 |
|
| 16 |
|
| 17 |
@st.cache_resource
|
| 18 |
+
def load_w2v(model):
|
| 19 |
+
with st.spinner('Загружаю языковую модель'):
|
| 20 |
+
if model == 'model1':
|
| 21 |
+
model_path = r'language_data/model1.gz'
|
| 22 |
+
else:
|
| 23 |
+
model_path = r'language_data/model2.gz'
|
| 24 |
+
return gensim.models.KeyedVectors.load_word2vec_format(model_path, binary=True)
|
| 25 |
|
| 26 |
|
| 27 |
@st.cache_resource
|
| 28 |
def load_spacy():
|
| 29 |
+
with st.spinner('Загружаю морфо-синтаксический парсер'):
|
| 30 |
+
_nlp = spacy.load('ru_core_news_lg')
|
| 31 |
return _nlp
|
| 32 |
|
| 33 |
|
| 34 |
@st.cache_resource
|
| 35 |
def load_bert():
|
| 36 |
+
with st.spinner('Загружаю языковую модель'):
|
| 37 |
+
_pipeline = pipeline(task="fill-mask", model="a-v-white/bert-base-spanish-wwm-cased-finetuned-literature-pro")
|
| 38 |
+
return _pipeline
|
| 39 |
|
| 40 |
|
| 41 |
+
@st.cache_resource
|
| 42 |
+
def load_summarizer():
|
| 43 |
+
return Summarizer()
|
| 44 |
+
|
| 45 |
+
@st.cache_resource
|
| 46 |
+
def load_classifiers(model):
|
| 47 |
+
if model == 'model1':
|
| 48 |
+
scaler_path = 'language_data/model1_no_wn_minmaxscaler.pickle'
|
| 49 |
+
classifier_path = 'language_data/model1_no_wn_catboost_classifier.pickle'
|
| 50 |
+
elif model == 'model2':
|
| 51 |
+
scaler_path = 'language_data/model2_no_wn_minmaxscaler.pickle'
|
| 52 |
+
classifier_path = 'language_data/model2_no_wn_catboost_classifier.pickle'
|
| 53 |
+
else:
|
| 54 |
+
scaler_path = 'language_data/model3_no_wn_minmaxscaler.pickle'
|
| 55 |
+
classifier_path = 'language_data/model3_no_wn_catboost_classifier.pickle'
|
| 56 |
+
with (open(scaler_path, 'rb') as f1, open(classifier_path, 'rb') as f2, open('language_data/pos_dict.pickle', 'rb') as f3):
|
| 57 |
+
scaler = load(f1)
|
| 58 |
+
classifier = load(f2)
|
| 59 |
+
pos_dict = load(f3)
|
| 60 |
+
return pos_dict, scaler, classifier
|
| 61 |
+
|
| 62 |
nlp = load_spacy()
|
| 63 |
morph = load_morph()
|
| 64 |
+
summarization = load_summarizer()
|
| 65 |
w2v_model1_path = r'model1.gz'
|
| 66 |
w2v_model2_path = r'model2.gz'
|
| 67 |
|
|
|
|
| 78 |
b2_path, b2_target_set = r'language_data/B2_MINIMUM.txt', set()
|
| 79 |
c1_path, c1_target_set = r'language_data/C1_MINIMUM.txt', set()
|
| 80 |
c2_path, c2_target_set = r'language_data/C2_MINIMUM.txt', set()
|
| 81 |
+
|
| 82 |
+
minimums_paths = (a1_path, a2_path, b1_path, b2_path, c1_path, c2_path)
|
| 83 |
minimums_sets = (a1_target_set, a2_target_set, b1_target_set, b2_target_set, c1_target_set, c2_target_set)
|
| 84 |
for i in range(len(minimums_paths)):
|
| 85 |
with open(minimums_paths[i], 'r', encoding='utf-8') as read_file:
|
| 86 |
for line in read_file:
|
| 87 |
minimums_sets[i].add(line.strip())
|
| 88 |
|
| 89 |
+
MINIMUM_SETS = {
|
| 90 |
+
'A1': (a1_target_set, a1_target_set),
|
| 91 |
+
'A2': (a2_target_set, a2_target_set.union(a1_target_set)),
|
| 92 |
+
'B1': (b1_target_set, b1_target_set.union(a2_target_set)),
|
| 93 |
+
'B2': (b2_target_set, b2_target_set.union(b1_target_set)),
|
| 94 |
+
'C1': (c1_target_set, c1_target_set.union(b2_target_set)),
|
| 95 |
+
'C2': (c2_target_set, c2_target_set.union(c1_target_set)),
|
| 96 |
+
'Без уровня': (None, None)
|
| 97 |
+
}
|
| 98 |
+
|
| 99 |
+
LEVEL_NUMBERS = {'A1': 1, 'A2': 2, 'B1': 3, 'B2': 4, 'C1': 5, 'C2': 6}
|
| 100 |
|
| 101 |
with open('language_data/phrases.json', 'r', encoding='utf-8') as f:
|
| 102 |
PHRASES = set(json.load(f)['PHRASES'])
|
| 103 |
|
|
|
|
|
|
|
|
|
|
| 104 |
BAD_USER_TARGET_WORDS = []
|
| 105 |
+
|
| 106 |
+
|
| 107 |
+
COMBINE_POS = {
|
| 108 |
+
'simple':
|
| 109 |
+
{
|
| 110 |
+
'A1': {'VERB': ['AUX']},
|
| 111 |
+
'A2': {'VERB': ['AUX']},
|
| 112 |
+
'B1': {'VERB': ['AUX']},
|
| 113 |
+
'B2': {'VERB': ['AUX']},
|
| 114 |
+
'C1': {'VERB': ['AUX']},
|
| 115 |
+
'C2': {'VERB': ['AUX']},
|
| 116 |
+
},
|
| 117 |
+
'phrase':
|
| 118 |
+
{
|
| 119 |
+
'A1': {'VERB': ['AUX']},
|
| 120 |
+
'A2': {'VERB': ['AUX']},
|
| 121 |
+
'B1': {'VERB': ['AUX']},
|
| 122 |
+
'B2': {'VERB': ['AUX']},
|
| 123 |
+
'C1': {'VERB': ['AUX']},
|
| 124 |
+
'C2': {'VERB': ['AUX']},
|
| 125 |
+
},
|
| 126 |
+
}
|
utilities_language_general/rus_utils.py
CHANGED
|
@@ -1,10 +1,8 @@
|
|
|
|
|
| 1 |
from nltk import edit_distance
|
| 2 |
from utilities.utils import answer_letter
|
| 3 |
-
from utilities_language_general.
|
| 4 |
-
from utilities_language_general.rus_constants import morph
|
| 5 |
-
from utilities_language_general.rus_constants import stop_list
|
| 6 |
-
from utilities_language_general.rus_constants import SIMILARITY_VALUES_w2v
|
| 7 |
-
from utilities_language_general.rus_constants import SIMILARITY_VALUES_bert
|
| 8 |
|
| 9 |
|
| 10 |
def prepare_target_words(target_words):
|
|
@@ -73,7 +71,7 @@ def get_tags(token: str):
|
|
| 73 |
return set(parts), method
|
| 74 |
|
| 75 |
|
| 76 |
-
def make_inflection(text: str, pos: str or list, tags: set) -> str or None:
|
| 77 |
if isinstance(pos, list):
|
| 78 |
if set(pos).issubset({'NOUN', 'ADJ', 'PROPN'}):
|
| 79 |
noun_adjective_phrase_tags = {'nomn', 'gent', 'datv', 'accs', 'ablt', 'loct', 'voct',
|
|
@@ -101,6 +99,75 @@ def make_inflection(text: str, pos: str or list, tags: set) -> str or None:
|
|
| 101 |
return None
|
| 102 |
else:
|
| 103 |
word_form = morph.parse(text)[0].inflect(tags)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 104 |
return word_form.word if word_form is not None else None
|
| 105 |
|
| 106 |
|
|
@@ -184,8 +251,8 @@ def check_token_bert(token, current_minimum: set = None, stop_words=stop_list,
|
|
| 184 |
return False
|
| 185 |
|
| 186 |
|
| 187 |
-
def get_distractors_from_model(model, lemma: str, pos: str, gender: str
|
| 188 |
-
distractor_minimum: set, level_name: str, max_num_distractors: int,
|
| 189 |
max_length_ratio=5, min_edit_distance_ratio=0.5):
|
| 190 |
distractors = []
|
| 191 |
query = lemma if '_' in lemma else f'{lemma}_{pos}'
|
|
@@ -205,12 +272,16 @@ def get_distractors_from_model(model, lemma: str, pos: str, gender: str or None,
|
|
| 205 |
distractor_similarity = candidate[1]
|
| 206 |
candidate_gender = define_gender(distractor_lemma)
|
| 207 |
length_ratio = abs(len(lemma) - len(distractor_lemma))
|
|
|
|
|
|
|
|
|
|
| 208 |
condition = ((distractor_pos == pos
|
| 209 |
-
or (
|
|
|
|
|
|
|
| 210 |
and distractor_lemma != lemma
|
| 211 |
and len(distractors) < 100
|
| 212 |
-
and
|
| 213 |
-
and candidate_gender == gender
|
| 214 |
and length_ratio <= max_length_ratio
|
| 215 |
and distractor_lemma not in global_distractors
|
| 216 |
and edit_distance(lemma, distractor_lemma) / ((len(lemma) + len(distractor_lemma)) / 2) >
|
|
@@ -228,16 +299,22 @@ def get_distractors_from_model(model, lemma: str, pos: str, gender: str or None,
|
|
| 228 |
continue
|
| 229 |
d1_lemma, d1_pos, d2_lemma, d2_pos = candidate[0].split('_')
|
| 230 |
distractor_lemma = f'{d1_lemma}_{d2_lemma}'
|
|
|
|
| 231 |
distractor_similarity = candidate[1]
|
|
|
|
|
|
|
|
|
|
| 232 |
condition = (((d1_pos == pos or d2_pos == pos)
|
|
|
|
|
|
|
| 233 |
or (d1_pos in ('VERB', 'AUX', 'SCONJ', 'ADP')
|
| 234 |
and pos in ('phrase', 'VERB', 'AUX', 'SCONJ', 'ADP'))
|
| 235 |
or (d2_pos in ('VERB', 'AUX', 'SCONJ', 'ADP')
|
| 236 |
and pos in ('phrase', 'VERB', 'AUX', 'SCONJ', 'ADP')))
|
|
|
|
| 237 |
and candidate[0] != lemma
|
| 238 |
and distractor_lemma != lemma
|
| 239 |
and len(distractors) < 100
|
| 240 |
-
and distractor_similarity < SIMILARITY_VALUES_w2v[level_name]
|
| 241 |
and distractor_lemma not in global_distractors)
|
| 242 |
if condition:
|
| 243 |
if distractor_minimum is not None:
|
|
@@ -255,8 +332,8 @@ def get_distractors_from_model(model, lemma: str, pos: str, gender: str or None,
|
|
| 255 |
return None
|
| 256 |
|
| 257 |
|
| 258 |
-
def get_distractors_from_model_bert(model,
|
| 259 |
-
global_distractors: set, distractor_minimum: set,
|
| 260 |
max_num_distractors: int, max_length_ratio=5, min_edit_distance_ratio=0.5):
|
| 261 |
_distractors = []
|
| 262 |
try:
|
|
@@ -283,12 +360,16 @@ def get_distractors_from_model_bert(model, text_with_masked_task: str, lemma: st
|
|
| 283 |
distractor_similarity = candidate_distractor[1]
|
| 284 |
candidate_gender = define_gender(distractor_lemma)
|
| 285 |
length_ratio = abs(len(lemma) - len(distractor_lemma))
|
|
|
|
|
|
|
|
|
|
| 286 |
if (((distractor_pos == pos)
|
| 287 |
-
or (
|
|
|
|
|
|
|
| 288 |
and distractor_lemma != lemma
|
| 289 |
and (len(_distractors) < max_num_distractors + 10)
|
| 290 |
-
and (
|
| 291 |
-
and (candidate_gender == gender)
|
| 292 |
and (length_ratio <= max_length_ratio) # May be changed if case of phrases
|
| 293 |
and (distractor_lemma not in global_distractors)
|
| 294 |
and (edit_distance(lemma, distractor_lemma) # May be changed if case of phrases
|
|
|
|
| 1 |
+
from random import randint
|
| 2 |
from nltk import edit_distance
|
| 3 |
from utilities.utils import answer_letter
|
| 4 |
+
from utilities_language_general.similarity_measures import make_decision
|
| 5 |
+
from utilities_language_general.rus_constants import nlp, morph, stop_list, COMBINE_POS
|
|
|
|
|
|
|
|
|
|
| 6 |
|
| 7 |
|
| 8 |
def prepare_target_words(target_words):
|
|
|
|
| 71 |
return set(parts), method
|
| 72 |
|
| 73 |
|
| 74 |
+
def make_inflection(text: str, pos: str or list, tags: set, level: str) -> str or None:
|
| 75 |
if isinstance(pos, list):
|
| 76 |
if set(pos).issubset({'NOUN', 'ADJ', 'PROPN'}):
|
| 77 |
noun_adjective_phrase_tags = {'nomn', 'gent', 'datv', 'accs', 'ablt', 'loct', 'voct',
|
|
|
|
| 99 |
return None
|
| 100 |
else:
|
| 101 |
word_form = morph.parse(text)[0].inflect(tags)
|
| 102 |
+
rnd = randint(0,5)
|
| 103 |
+
if pos == 'VERB' and 'impf' in tags and level in ('A1', 'A2'):
|
| 104 |
+
tags.discard('impf')
|
| 105 |
+
tags.add('perf')
|
| 106 |
+
word_form = morph.parse(text)[0].inflect(tags)
|
| 107 |
+
if word_form is not None:
|
| 108 |
+
return word_form.word
|
| 109 |
+
elif pos == 'VERB' and 'perf' in tags and level in ('A1', 'A2'):
|
| 110 |
+
tags.discard('perf')
|
| 111 |
+
tags.add('impf')
|
| 112 |
+
word_form = morph.parse(text)[0].inflect(tags)
|
| 113 |
+
if word_form is not None:
|
| 114 |
+
return word_form.word
|
| 115 |
+
if pos == 'NOUN' and level in ('A1', 'A2'):
|
| 116 |
+
if word_form is None:
|
| 117 |
+
if 'inan' in tags:
|
| 118 |
+
tags.discard('inan')
|
| 119 |
+
tags.add('anim')
|
| 120 |
+
elif 'anim' in tags:
|
| 121 |
+
tags.discard('anim')
|
| 122 |
+
tags.add('inan')
|
| 123 |
+
if pos in ('NOUN', 'ADJ') and level in ('A1, A2') and rnd == 0:
|
| 124 |
+
if 'sing' in tags:
|
| 125 |
+
tags.discard('sing')
|
| 126 |
+
tags.add('plur')
|
| 127 |
+
elif 'plur' in tags:
|
| 128 |
+
tags.discard('plur')
|
| 129 |
+
tags.add('sing')
|
| 130 |
+
elif pos in ('NOUN', 'ADJ') and level in ('A1', 'A2') and rnd == 1:
|
| 131 |
+
if 'masc' in tags:
|
| 132 |
+
tags.remove('masc')
|
| 133 |
+
tags.add('femn')
|
| 134 |
+
word_form = morph.parse(text)[0].inflect(tags)
|
| 135 |
+
if word_form is not None:
|
| 136 |
+
return word_form.word
|
| 137 |
+
else:
|
| 138 |
+
tags.remove('femn')
|
| 139 |
+
tags.add('neut')
|
| 140 |
+
word_form = morph.parse(text)[0].inflect(tags)
|
| 141 |
+
if word_form is not None:
|
| 142 |
+
return word_form.word
|
| 143 |
+
if 'femn' in tags:
|
| 144 |
+
tags.remove('femn')
|
| 145 |
+
tags.add('masc')
|
| 146 |
+
word_form = morph.parse(text)[0].inflect(tags)
|
| 147 |
+
if word_form is not None:
|
| 148 |
+
return word_form.word
|
| 149 |
+
else:
|
| 150 |
+
tags.remove('masc')
|
| 151 |
+
tags.add('neut')
|
| 152 |
+
word_form = morph.parse(text)[0].inflect(tags)
|
| 153 |
+
if word_form is not None:
|
| 154 |
+
return word_form.word
|
| 155 |
+
if 'neut' in tags:
|
| 156 |
+
tags.remove('neut')
|
| 157 |
+
tags.add('masc')
|
| 158 |
+
word_form = morph.parse(text)[0].inflect(tags)
|
| 159 |
+
if word_form is not None:
|
| 160 |
+
return word_form.word
|
| 161 |
+
else:
|
| 162 |
+
tags.remove('masc')
|
| 163 |
+
tags.add('femn')
|
| 164 |
+
word_form = morph.parse(text)[0].inflect(tags)
|
| 165 |
+
if word_form is not None:
|
| 166 |
+
return word_form.word
|
| 167 |
+
else:
|
| 168 |
+
word_form = morph.parse(text)[0].inflect(tags)
|
| 169 |
+
return word_form.word if word_form is not None else None
|
| 170 |
+
word_form = morph.parse(text)[0].inflect(tags)
|
| 171 |
return word_form.word if word_form is not None else None
|
| 172 |
|
| 173 |
|
|
|
|
| 251 |
return False
|
| 252 |
|
| 253 |
|
| 254 |
+
def get_distractors_from_model(doc, model, scaler, classifier, pos_dict:dict, target_text:str, lemma: str, pos: str, gender: str,
|
| 255 |
+
lemma_index:int, global_distractors: set, distractor_minimum: set, level_name: str, max_num_distractors: int,
|
| 256 |
max_length_ratio=5, min_edit_distance_ratio=0.5):
|
| 257 |
distractors = []
|
| 258 |
query = lemma if '_' in lemma else f'{lemma}_{pos}'
|
|
|
|
| 272 |
distractor_similarity = candidate[1]
|
| 273 |
candidate_gender = define_gender(distractor_lemma)
|
| 274 |
length_ratio = abs(len(lemma) - len(distractor_lemma))
|
| 275 |
+
decision = make_decision(doc, model_type='w2v', model=model, scaler=scaler, classifier=classifier, pos_dict=pos_dict,
|
| 276 |
+
level=level_name, target_lemma=query, target_text=target_text, target_pos=pos, target_position=lemma_index,
|
| 277 |
+
substitute_lemma=distractor_lemma, substitute_pos=distractor_pos)
|
| 278 |
condition = ((distractor_pos == pos
|
| 279 |
+
or (COMBINE_POS['simple'][level_name].get(pos) is not None and COMBINE_POS['simple'][level_name].get(distractor_pos) is not None
|
| 280 |
+
and distractor_pos in COMBINE_POS['simple'][level_name][pos] and pos in COMBINE_POS['simple'][level_name][distractor_pos]))
|
| 281 |
+
and decision
|
| 282 |
and distractor_lemma != lemma
|
| 283 |
and len(distractors) < 100
|
| 284 |
+
and (candidate_gender == gender and level_name in ('B1', 'B2', 'C1', 'C2'))
|
|
|
|
| 285 |
and length_ratio <= max_length_ratio
|
| 286 |
and distractor_lemma not in global_distractors
|
| 287 |
and edit_distance(lemma, distractor_lemma) / ((len(lemma) + len(distractor_lemma)) / 2) >
|
|
|
|
| 299 |
continue
|
| 300 |
d1_lemma, d1_pos, d2_lemma, d2_pos = candidate[0].split('_')
|
| 301 |
distractor_lemma = f'{d1_lemma}_{d2_lemma}'
|
| 302 |
+
d_pos = f'{d1_pos}_{d2_pos}'
|
| 303 |
distractor_similarity = candidate[1]
|
| 304 |
+
decision = make_decision(doc, model_type='w2v', model=model, scaler=scaler, classifier=classifier, pos_dict=pos_dict,
|
| 305 |
+
level=level_name, target_lemma=query, target_text=target_text, target_pos=pos, target_position=lemma_index,
|
| 306 |
+
substitute_lemma=candidate[0], substitute_pos=d_pos)
|
| 307 |
condition = (((d1_pos == pos or d2_pos == pos)
|
| 308 |
+
or (COMBINE_POS['simple'][level_name].get(pos) is not None and COMBINE_POS['simple'][level_name].get(distractor_pos) is not None
|
| 309 |
+
and distractor_pos in COMBINE_POS['simple'][level_name][pos] and pos in COMBINE_POS['simple'][level_name][distractor_pos])
|
| 310 |
or (d1_pos in ('VERB', 'AUX', 'SCONJ', 'ADP')
|
| 311 |
and pos in ('phrase', 'VERB', 'AUX', 'SCONJ', 'ADP'))
|
| 312 |
or (d2_pos in ('VERB', 'AUX', 'SCONJ', 'ADP')
|
| 313 |
and pos in ('phrase', 'VERB', 'AUX', 'SCONJ', 'ADP')))
|
| 314 |
+
and decision
|
| 315 |
and candidate[0] != lemma
|
| 316 |
and distractor_lemma != lemma
|
| 317 |
and len(distractors) < 100
|
|
|
|
| 318 |
and distractor_lemma not in global_distractors)
|
| 319 |
if condition:
|
| 320 |
if distractor_minimum is not None:
|
|
|
|
| 332 |
return None
|
| 333 |
|
| 334 |
|
| 335 |
+
def get_distractors_from_model_bert(model, scaler, classifier, pos_dict:dict, level_name: str, lemma: str, pos: str, gender: str,
|
| 336 |
+
text_with_masked_task: str, global_distractors: set, distractor_minimum: set,
|
| 337 |
max_num_distractors: int, max_length_ratio=5, min_edit_distance_ratio=0.5):
|
| 338 |
_distractors = []
|
| 339 |
try:
|
|
|
|
| 360 |
distractor_similarity = candidate_distractor[1]
|
| 361 |
candidate_gender = define_gender(distractor_lemma)
|
| 362 |
length_ratio = abs(len(lemma) - len(distractor_lemma))
|
| 363 |
+
decision = make_decision(doc=None, model_type='bert', scaler=scaler, classifier=classifier, pos_dict=pos_dict, level=level_name,
|
| 364 |
+
target_lemma=lemma, target_text=None, target_pos=pos, target_position=None,
|
| 365 |
+
substitute_lemma=distractor_lemma, substitute_pos=distractor_pos, bert_score=distractor_similarity)
|
| 366 |
if (((distractor_pos == pos)
|
| 367 |
+
or (COMBINE_POS['phrase'][level_name].get(pos) is not None and COMBINE_POS['phrase'][level_name].get(distractor_pos) is not None
|
| 368 |
+
and distractor_pos in COMBINE_POS['phrase'][level_name][pos] and pos in COMBINE_POS['phrase'][level_name][distractor_pos]))
|
| 369 |
+
and decision
|
| 370 |
and distractor_lemma != lemma
|
| 371 |
and (len(_distractors) < max_num_distractors + 10)
|
| 372 |
+
and (candidate_gender == gender and level_name in ('B1', 'B2', 'C1', 'C2'))
|
|
|
|
| 373 |
and (length_ratio <= max_length_ratio) # May be changed if case of phrases
|
| 374 |
and (distractor_lemma not in global_distractors)
|
| 375 |
and (edit_distance(lemma, distractor_lemma) # May be changed if case of phrases
|
utilities_language_general/similarity_measures.py
ADDED
|
@@ -0,0 +1,255 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import numpy as np
|
| 2 |
+
from math import pow
|
| 3 |
+
from nltk.corpus import wordnet as wn
|
| 4 |
+
from utilities_language_general.rus_constants import nlp, PHRASES, LEVEL_NUMBERS
|
| 5 |
+
|
| 6 |
+
def eucledian_distance(x, y):
|
| 7 |
+
return np.sqrt(np.sum((x - y) ** 2))
|
| 8 |
+
|
| 9 |
+
def cosine_similarity(x, y):
|
| 10 |
+
out = np.dot(x, y) / (np.sqrt(np.dot(x, x)) * np.sqrt(np.dot(y, y)))
|
| 11 |
+
if str(out) != 'nan':
|
| 12 |
+
return out
|
| 13 |
+
return None
|
| 14 |
+
|
| 15 |
+
def get_vector_for_token(model, token):
|
| 16 |
+
vector = None
|
| 17 |
+
|
| 18 |
+
splitted = token.split('_')
|
| 19 |
+
token_list = [f'{splitted[i]}_{splitted[i+1]}' for i in range(len(splitted)-1)]
|
| 20 |
+
|
| 21 |
+
if model.has_index_for(token):
|
| 22 |
+
vector = model.get_vector(token)
|
| 23 |
+
else:
|
| 24 |
+
try:
|
| 25 |
+
vector = model.get_mean_vector(token_list)
|
| 26 |
+
except ValueError:
|
| 27 |
+
return None
|
| 28 |
+
return vector
|
| 29 |
+
|
| 30 |
+
def compute_metric(func, vector1, vector2):
|
| 31 |
+
if vector1 is not None and vector2 is not None:
|
| 32 |
+
return func(vector1, vector2)
|
| 33 |
+
else:
|
| 34 |
+
return None
|
| 35 |
+
|
| 36 |
+
def compute_positive_cos(x, y):
|
| 37 |
+
cos_sim = cosine_similarity(x, y)
|
| 38 |
+
if cos_sim:
|
| 39 |
+
return (cos_sim + 1) / 2
|
| 40 |
+
else:
|
| 41 |
+
return None
|
| 42 |
+
|
| 43 |
+
def addition_metric(substitute, target, context):
|
| 44 |
+
substitute_target_cos = compute_metric(cosine_similarity, substitute, target)
|
| 45 |
+
if not substitute_target_cos:
|
| 46 |
+
return None
|
| 47 |
+
if not context:
|
| 48 |
+
return None
|
| 49 |
+
|
| 50 |
+
context_vectors = []
|
| 51 |
+
for context_tk in context:
|
| 52 |
+
substitute_context_cos = compute_metric(cosine_similarity, substitute, context_tk)
|
| 53 |
+
if substitute_context_cos:
|
| 54 |
+
context_vectors.append(substitute_context_cos)
|
| 55 |
+
sum_of_context_vectors = np.sum(context_vectors)
|
| 56 |
+
|
| 57 |
+
metric = (substitute_target_cos + sum_of_context_vectors) / (len(context) + 1)
|
| 58 |
+
return metric
|
| 59 |
+
|
| 60 |
+
def balanced_addition_metric(substitute, target, context):
|
| 61 |
+
substitute_target_cos = compute_metric(cosine_similarity, substitute, target)
|
| 62 |
+
if not substitute_target_cos:
|
| 63 |
+
return None
|
| 64 |
+
if not context:
|
| 65 |
+
return None
|
| 66 |
+
|
| 67 |
+
context_vectors = []
|
| 68 |
+
for context_tk in context:
|
| 69 |
+
substitute_context_cos = compute_metric(cosine_similarity, substitute, context_tk)
|
| 70 |
+
if substitute_context_cos:
|
| 71 |
+
context_vectors.append(substitute_context_cos)
|
| 72 |
+
sum_of_context_vectors = np.sum(context_vectors)
|
| 73 |
+
|
| 74 |
+
context_len = len(context)
|
| 75 |
+
metric = (context_len * substitute_target_cos + sum_of_context_vectors) / (2 * context_len)
|
| 76 |
+
return metric
|
| 77 |
+
|
| 78 |
+
def multiplication_metric(substitute, target, context):
|
| 79 |
+
substitute_target_cos = compute_metric(compute_positive_cos, substitute, target)
|
| 80 |
+
if not substitute_target_cos:
|
| 81 |
+
return None
|
| 82 |
+
if not context:
|
| 83 |
+
return None
|
| 84 |
+
|
| 85 |
+
context_vectors = []
|
| 86 |
+
for context_tk in context:
|
| 87 |
+
substitute_context_positive_cos = compute_metric(compute_positive_cos, substitute, context_tk)
|
| 88 |
+
if substitute_context_positive_cos:
|
| 89 |
+
context_vectors.append(substitute_context_positive_cos)
|
| 90 |
+
prod_of_context_vectors = np.prod(context_vectors)
|
| 91 |
+
try:
|
| 92 |
+
metric = pow((substitute_target_cos + prod_of_context_vectors), 1 / (len(context) + 1))
|
| 93 |
+
except ValueError:
|
| 94 |
+
return None
|
| 95 |
+
return metric
|
| 96 |
+
|
| 97 |
+
def balanced_multiplication_metric(substitute, target, context):
|
| 98 |
+
substitute_target_cos = compute_metric(compute_positive_cos, substitute, target)
|
| 99 |
+
if not substitute_target_cos:
|
| 100 |
+
return None
|
| 101 |
+
if not context:
|
| 102 |
+
return None
|
| 103 |
+
|
| 104 |
+
context_vectors = []
|
| 105 |
+
for context_tk in context:
|
| 106 |
+
substitute_context_positive_cos = compute_metric(compute_positive_cos, substitute, context_tk)
|
| 107 |
+
if substitute_context_positive_cos:
|
| 108 |
+
context_vectors.append(substitute_context_positive_cos)
|
| 109 |
+
prod_of_context_vectors = np.prod(context_vectors)
|
| 110 |
+
|
| 111 |
+
context_len = len(context)
|
| 112 |
+
try:
|
| 113 |
+
metric = pow((pow(substitute_target_cos, context_len) + prod_of_context_vectors), 1 / (2 * context_len))
|
| 114 |
+
except ValueError:
|
| 115 |
+
return None
|
| 116 |
+
return metric
|
| 117 |
+
|
| 118 |
+
def bind_phrases(context_list):
|
| 119 |
+
context = []
|
| 120 |
+
previous_was_phrase = False
|
| 121 |
+
for i in range(len(context_list)-1):
|
| 122 |
+
phrase_candidate = f'{context_list[i]}_{context_list[i+1]}'
|
| 123 |
+
if phrase_candidate in PHRASES and not previous_was_phrase:
|
| 124 |
+
context.append(phrase_candidate)
|
| 125 |
+
previous_was_phrase = True
|
| 126 |
+
else:
|
| 127 |
+
if not previous_was_phrase:
|
| 128 |
+
context.append(context_list[i])
|
| 129 |
+
previous_was_phrase = False
|
| 130 |
+
if context_list:
|
| 131 |
+
if not context:
|
| 132 |
+
context.append(context_list[-1])
|
| 133 |
+
elif not context_list[-1] in context[-1]:
|
| 134 |
+
context.append(context_list[-1])
|
| 135 |
+
return context
|
| 136 |
+
|
| 137 |
+
def get_context_windows(doc, target_text, window_size):
|
| 138 |
+
sentence_str = doc.text
|
| 139 |
+
sentence_masked = sentence_str.lower().replace(target_text.lower().strip(), ' [MASK] ')
|
| 140 |
+
alpha_tokens_lemma_pos = [f'{tk.lemma_.lower()}_{tk.pos_}' for tk in nlp(sentence_masked) if tk.text.isalpha()]
|
| 141 |
+
alpha_tokens_lemma_pos_no_stop = [f'{tk.lemma_.lower()}_{tk.pos_}' for tk in nlp(sentence_masked) if tk.text.isalpha() and not tk.is_stop]
|
| 142 |
+
try:
|
| 143 |
+
mask_token_index = alpha_tokens_lemma_pos.index('mask_PROPN')
|
| 144 |
+
mask_token_index_no_stop = alpha_tokens_lemma_pos_no_stop.index('mask_PROPN')
|
| 145 |
+
except ValueError:
|
| 146 |
+
return None
|
| 147 |
+
|
| 148 |
+
left_border = max(mask_token_index-window_size, 0)
|
| 149 |
+
right_border = min(mask_token_index+window_size, len(alpha_tokens_lemma_pos))
|
| 150 |
+
l_context = alpha_tokens_lemma_pos[left_border:mask_token_index]
|
| 151 |
+
r_context = alpha_tokens_lemma_pos[mask_token_index+1:right_border+1]
|
| 152 |
+
|
| 153 |
+
left_border_no_stop = max(mask_token_index_no_stop-window_size, 0)
|
| 154 |
+
right_border_no_stop = min(mask_token_index_no_stop+window_size, len(alpha_tokens_lemma_pos_no_stop))
|
| 155 |
+
l_context_no_stop = alpha_tokens_lemma_pos_no_stop[left_border_no_stop:mask_token_index_no_stop]
|
| 156 |
+
r_context_no_stop = alpha_tokens_lemma_pos_no_stop[mask_token_index_no_stop+1:right_border_no_stop+1]
|
| 157 |
+
return (bind_phrases(l_context) + bind_phrases(r_context), bind_phrases(l_context_no_stop) + bind_phrases(r_context_no_stop))
|
| 158 |
+
|
| 159 |
+
def get_context_linked_words(doc, target_position, target_text):
|
| 160 |
+
answer_list = target_text.split(' ')
|
| 161 |
+
context_words = []
|
| 162 |
+
for tk in doc:
|
| 163 |
+
if tk.text.isalpha():
|
| 164 |
+
if (tk.text in answer_list and abs(target_position - tk.idx) <= sum([len(t) for t in answer_list])):
|
| 165 |
+
context_words.extend([t for t in tk.subtree if t.text.isalpha() and not t.is_stop])
|
| 166 |
+
context_words.extend([t for t in tk.children if t.text.isalpha() and not t.is_stop])
|
| 167 |
+
context_words.extend([t for t in tk.ancestors if t.text.isalpha() and not t.is_stop])
|
| 168 |
+
context_words = [(tk, f'{tk.lemma_}_{tk.pos_}') for tk in sorted(set(context_words), key=lambda tk: tk.i) if tk.text not in answer_list]
|
| 169 |
+
context = []
|
| 170 |
+
previous_was_phrase = False
|
| 171 |
+
for i in range(len(context_words)-1):
|
| 172 |
+
phrase_candidate = f'{context_words[i][1]}_{context_words[i+1][1]}'
|
| 173 |
+
if phrase_candidate in PHRASES and not previous_was_phrase and abs(context_words[i][0].i - context_words[i+1][0].i) <=1:
|
| 174 |
+
context.append(phrase_candidate)
|
| 175 |
+
previous_was_phrase = True
|
| 176 |
+
else:
|
| 177 |
+
if not previous_was_phrase:
|
| 178 |
+
context.append(context_words[i][1])
|
| 179 |
+
if context and context_words:
|
| 180 |
+
if not context_words[-1][1] in context[-1]:
|
| 181 |
+
context.append(context_words[-1][1])
|
| 182 |
+
elif context_words:
|
| 183 |
+
context.append(context_words[-1][1])
|
| 184 |
+
return context
|
| 185 |
+
|
| 186 |
+
|
| 187 |
+
def compute_all_necessary_metrics(target_lemma, target_text, target_position, substitute_lemma, doc, model_type:str, model=None):
|
| 188 |
+
|
| 189 |
+
target_vector = get_vector_for_token(model, target_lemma)
|
| 190 |
+
substitute_vector = get_vector_for_token(model, substitute_lemma)
|
| 191 |
+
|
| 192 |
+
cosimilarity = compute_metric(cosine_similarity, substitute_vector, target_vector)
|
| 193 |
+
eucledian_similarity = compute_metric(eucledian_distance, substitute_vector, target_vector)
|
| 194 |
+
|
| 195 |
+
context_window3, context_window3_no_stop = get_context_windows(doc=doc, target_text=target_text, window_size=3)
|
| 196 |
+
context_window5, context_window5_no_stop = get_context_windows(doc=doc, target_text=target_text, window_size=5)
|
| 197 |
+
context_window_synt = get_context_linked_words(doc, target_position, target_text)
|
| 198 |
+
|
| 199 |
+
context_window3 = [get_vector_for_token(model, token) for token in context_window3]
|
| 200 |
+
context_window3_no_stop = [get_vector_for_token(model, token) for token in context_window3_no_stop]
|
| 201 |
+
context_window5 = [get_vector_for_token(model, token) for token in context_window5]
|
| 202 |
+
context_window5_no_stop = [get_vector_for_token(model, token) for token in context_window5_no_stop]
|
| 203 |
+
context_window_synt = [get_vector_for_token(model, token) for token in context_window_synt]
|
| 204 |
+
|
| 205 |
+
add_metric_window3 = addition_metric(target_vector, substitute_vector, context_window3)
|
| 206 |
+
bal_add_metric_window3 = balanced_addition_metric(target_vector, substitute_vector, context_window3)
|
| 207 |
+
add_metric_window3_no_stop = addition_metric(target_vector, substitute_vector, context_window3_no_stop)
|
| 208 |
+
bal_add_metric_window3_no_stop = balanced_addition_metric(target_vector, substitute_vector, context_window3_no_stop)
|
| 209 |
+
|
| 210 |
+
mult_metric_window3 = multiplication_metric(target_vector, substitute_vector, context_window3)
|
| 211 |
+
bal_mult_metric_window3 = balanced_multiplication_metric(target_vector, substitute_vector, context_window3)
|
| 212 |
+
mult_metric_window3_no_stop = multiplication_metric(target_vector, substitute_vector, context_window3_no_stop)
|
| 213 |
+
bal_mult_metric_window3_no_stop = balanced_multiplication_metric(target_vector, substitute_vector, context_window3_no_stop)
|
| 214 |
+
|
| 215 |
+
add_metric_window5 = addition_metric(target_vector, substitute_vector, context_window5)
|
| 216 |
+
bal_add_metric_window5 = balanced_addition_metric(target_vector, substitute_vector, context_window5)
|
| 217 |
+
add_metric_window5_no_stop = addition_metric(target_vector, substitute_vector, context_window5_no_stop)
|
| 218 |
+
bal_add_metric_window5_no_stop = balanced_addition_metric(target_vector, substitute_vector, context_window5_no_stop)
|
| 219 |
+
|
| 220 |
+
mult_metric_window5 = multiplication_metric(target_vector, substitute_vector, context_window5)
|
| 221 |
+
bal_mult_metric_window5 = balanced_multiplication_metric(target_vector, substitute_vector, context_window5)
|
| 222 |
+
mult_metric_window5_no_stop = multiplication_metric(target_vector, substitute_vector, context_window5_no_stop)
|
| 223 |
+
bal_mult_metric_window5_no_stop = balanced_multiplication_metric(target_vector, substitute_vector, context_window5_no_stop)
|
| 224 |
+
|
| 225 |
+
add_metric_synt = addition_metric(target_vector, substitute_vector, context_window_synt)
|
| 226 |
+
bal_add_metric_synt = balanced_addition_metric(target_vector, substitute_vector, context_window_synt)
|
| 227 |
+
|
| 228 |
+
mult_metric_synt = multiplication_metric(target_vector, substitute_vector, context_window_synt)
|
| 229 |
+
bal_mult_metric_synt = balanced_multiplication_metric(target_vector, substitute_vector, context_window_synt)
|
| 230 |
+
|
| 231 |
+
return (cosimilarity, eucledian_similarity,
|
| 232 |
+
add_metric_window3, bal_add_metric_window3,
|
| 233 |
+
mult_metric_window3, bal_mult_metric_window3,
|
| 234 |
+
add_metric_window3_no_stop, bal_add_metric_window3_no_stop,
|
| 235 |
+
mult_metric_window3_no_stop, bal_mult_metric_window3_no_stop,
|
| 236 |
+
add_metric_window5, bal_add_metric_window5,
|
| 237 |
+
mult_metric_window5, bal_mult_metric_window5,
|
| 238 |
+
add_metric_window5_no_stop, bal_add_metric_window5_no_stop,
|
| 239 |
+
mult_metric_window5_no_stop, bal_mult_metric_window5_no_stop,
|
| 240 |
+
add_metric_synt, bal_add_metric_synt,
|
| 241 |
+
mult_metric_synt, bal_mult_metric_synt)
|
| 242 |
+
|
| 243 |
+
def make_decision(doc, model_type, scaler, classifier, pos_dict, level, target_lemma, target_text, target_pos, target_position,
|
| 244 |
+
substitute_lemma, substitute_pos, model=None, bert_score=None):
|
| 245 |
+
# return True
|
| 246 |
+
metrics = compute_all_necessary_metrics(target_lemma=target_lemma, target_text=target_text, target_position=target_position,
|
| 247 |
+
substitute_lemma=substitute_lemma, doc=doc, model_type=model_type, model=model)
|
| 248 |
+
target_multiword, substitute_multiword = target_lemma.count('_') > 2, substitute_lemma.count('_') > 2
|
| 249 |
+
data = [LEVEL_NUMBERS.get(level), pos_dict.get(target_pos), target_multiword, pos_dict.get(substitute_pos), substitute_multiword] + scaler.transform([metrics]).tolist()[0]
|
| 250 |
+
if model_type == 'bert':
|
| 251 |
+
data = [LEVEL_NUMBERS.get(level), pos_dict.get(target_pos), target_multiword, pos_dict.get(substitute_pos), substitute_multiword, bert_score]
|
| 252 |
+
predict = classifier.predict(data)
|
| 253 |
+
return bool(predict)
|
| 254 |
+
|
| 255 |
+
|
utilities_language_w2v/rus_main_workflow_w2v.py
CHANGED
|
@@ -1,31 +1,20 @@
|
|
| 1 |
import datetime
|
| 2 |
from io import StringIO
|
|
|
|
| 3 |
from random import sample
|
| 4 |
from collections import defaultdict
|
| 5 |
-
from streamlit import progress as st_progress
|
| 6 |
-
from streamlit.elements import WIDGETS as ST_WIDGETS
|
| 7 |
-
from utilities_language_general.rus_constants import st
|
| 8 |
-
from utilities_language_w2v.rus_sentence_w2v import TASK
|
| 9 |
-
from utilities_language_general.rus_constants import load_w2v
|
| 10 |
-
from utilities_language_general.rus_utils import prepare_tasks
|
| 11 |
-
from utilities_language_w2v.rus_sentence_w2v import SENTENCE
|
| 12 |
-
import utilities_language_general.rus_constants as esp_constants
|
| 13 |
-
from utilities_language_general.rus_utils import prepare_target_words
|
| 14 |
-
from utilities_language_general.rus_constants import w2v_model1_path
|
| 15 |
-
from utilities_language_general.rus_constants import w2v_model2_path
|
| 16 |
-
from utilities_language_general.rus_utils import compute_frequency_dict
|
| 17 |
from streamlit.runtime.uploaded_file_manager import UploadedFile
|
| 18 |
-
from
|
|
|
|
|
|
|
| 19 |
|
| 20 |
|
| 21 |
-
def
|
| 22 |
-
file: UploadedFile
|
| 23 |
text: str,
|
| 24 |
-
logs
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
progress_d: st_progress,
|
| 28 |
-
progress_s: st_progress,
|
| 29 |
level: str,
|
| 30 |
tw_mode_automatic_mode: str,
|
| 31 |
target_words: str,
|
|
@@ -56,19 +45,13 @@ def main_workflow_w2v(
|
|
| 56 |
student_out, teacher_out, total_out, original_text
|
| 57 |
"""
|
| 58 |
# Clear bad target_words each time
|
| 59 |
-
global_bad_target_words
|
|
|
|
| 60 |
|
| 61 |
# Define main global variables
|
| 62 |
-
logs.write()
|
| 63 |
GLOBAL_DISTRACTORS = set()
|
| 64 |
MAX_FREQUENCY = 0
|
| 65 |
|
| 66 |
-
# Define which model is used for distractor generation
|
| 67 |
-
if model_name == 'Модель-1':
|
| 68 |
-
mask_filler = load_w2v(w2v_model1_path)
|
| 69 |
-
else:
|
| 70 |
-
mask_filler = load_w2v(w2v_model2_path)
|
| 71 |
-
|
| 72 |
# Get input text
|
| 73 |
if file is not None:
|
| 74 |
stringio = StringIO(file.getvalue().decode("utf-8"))
|
|
@@ -76,15 +59,15 @@ def main_workflow_w2v(
|
|
| 76 |
elif text != '':
|
| 77 |
current_text = text
|
| 78 |
else:
|
| 79 |
-
|
| 80 |
current_text = ''
|
| 81 |
-
|
| 82 |
|
| 83 |
# Process target words
|
| 84 |
if tw_mode_automatic_mode == 'Самостоятельно':
|
| 85 |
if target_words == '':
|
| 86 |
-
|
| 87 |
-
|
| 88 |
# Cannot make up paradigm, so only USER_TARGET_WORDS is used
|
| 89 |
USER_TARGET_WORDS = prepare_target_words(target_words)
|
| 90 |
tw_mode_automatic_mode = False
|
|
@@ -97,8 +80,8 @@ def main_workflow_w2v(
|
|
| 97 |
current_text = current_text.replace('.', '. ').replace('. . .', '...').replace(' ', ' ').replace('…', '...') \
|
| 98 |
.replace('…', '...').replace('—', '-').replace('\u2014', '-').replace('—', '-').replace('-\n', '') \
|
| 99 |
.replace('\n', '%^&*')
|
| 100 |
-
current_text_sentences = [sent.text.strip() for sent in
|
| 101 |
-
logs.
|
| 102 |
progress.progress(10)
|
| 103 |
|
| 104 |
# Compute frequency dict
|
|
@@ -110,49 +93,55 @@ def main_workflow_w2v(
|
|
| 110 |
if j < len(FREQ_DICT) * _frequency_barrier_percent:
|
| 111 |
MAX_FREQUENCY = tp[1]
|
| 112 |
MAX_FREQUENCY = 3 if MAX_FREQUENCY < 3 else MAX_FREQUENCY
|
| 113 |
-
logs.
|
| 114 |
progress.progress(15)
|
| 115 |
|
| 116 |
# Choose necessary language minimum according to user's input
|
| 117 |
-
if level
|
| 118 |
-
target_minimum =
|
| 119 |
-
distractor_minimum = esp_constants.a1_distractor_set
|
| 120 |
-
elif level == 'A2':
|
| 121 |
-
target_minimum = esp_constants.a2_target_set
|
| 122 |
-
distractor_minimum = esp_constants.a2_distractor_set
|
| 123 |
-
elif level == 'B1':
|
| 124 |
-
target_minimum = esp_constants.b1_target_set
|
| 125 |
-
distractor_minimum = esp_constants.b1_distractor_set
|
| 126 |
-
elif level == 'B2':
|
| 127 |
-
target_minimum = esp_constants.b2_target_set
|
| 128 |
-
distractor_minimum = esp_constants.b2_distractor_set
|
| 129 |
-
elif level == 'C1':
|
| 130 |
-
target_minimum = esp_constants.c1_target_set
|
| 131 |
-
distractor_minimum = esp_constants.c1_distractor_set
|
| 132 |
-
elif level == 'C2':
|
| 133 |
-
target_minimum = esp_constants.c2_target_set
|
| 134 |
-
distractor_minimum = esp_constants.c2_distractor_set
|
| 135 |
-
elif level == 'Без уровня':
|
| 136 |
-
target_minimum = None
|
| 137 |
-
distractor_minimum = None
|
| 138 |
else:
|
| 139 |
target_minimum = None
|
| 140 |
distractor_minimum = None
|
| 141 |
logs.error('Вы не выбрали языковой уровень!')
|
| 142 |
st.stop()
|
| 143 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 144 |
# Start generation process
|
| 145 |
workflow = [SENTENCE(original=sent.strip(), n_sentence=num, max_num_distractors=num_distractors)
|
| 146 |
for num, sent in enumerate(current_text_sentences)]
|
| 147 |
-
logs.
|
| 148 |
progress.progress(20)
|
| 149 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 150 |
for sentence in workflow:
|
| 151 |
sentence.lemmatize_sentence()
|
| 152 |
|
| 153 |
for sentence in workflow:
|
| 154 |
sentence.bind_phrases()
|
| 155 |
-
logs.
|
| 156 |
progress.progress(30)
|
| 157 |
|
| 158 |
for j, sentence in enumerate(workflow):
|
|
@@ -160,9 +149,10 @@ def main_workflow_w2v(
|
|
| 160 |
target_words_automatic_mode=tw_mode_automatic_mode,
|
| 161 |
target_minimum=target_minimum,
|
| 162 |
user_target_words=USER_TARGET_WORDS,
|
| 163 |
-
frequency_dict=FREQ_DICT
|
|
|
|
| 164 |
progress.progress(int(30 + (j * (30 / len(workflow)))))
|
| 165 |
-
|
| 166 |
DUPLICATE_TARGET_WORDS = defaultdict(list)
|
| 167 |
for sentence in workflow:
|
| 168 |
for target_word in sentence.target_words:
|
|
@@ -175,31 +165,33 @@ def main_workflow_w2v(
|
|
| 175 |
if target_word not in RESULT_TW:
|
| 176 |
global_bad_target_words.append(target_word['original_text'])
|
| 177 |
sentence.target_words.remove(target_word)
|
| 178 |
-
|
| 179 |
-
logs.
|
| 180 |
|
| 181 |
for sentence in workflow:
|
| 182 |
sentence.attach_distractors_to_target_word(model=mask_filler,
|
|
|
|
|
|
|
|
|
|
| 183 |
global_distractors=GLOBAL_DISTRACTORS,
|
| 184 |
distractor_minimum=distractor_minimum,
|
| 185 |
level_name=level,
|
| 186 |
max_frequency=MAX_FREQUENCY,
|
| 187 |
-
progress=progress_d
|
| 188 |
-
|
| 189 |
-
|
| 190 |
-
logs.success('Подобрали неправильные варианты!')
|
| 191 |
for sentence in workflow:
|
| 192 |
-
sentence.inflect_distractors()
|
| 193 |
-
|
| 194 |
-
logs.
|
| 195 |
|
| 196 |
for sentence in workflow:
|
| 197 |
sentence.filter_target_words(target_words_automatic_mode=tw_mode_automatic_mode)
|
| 198 |
|
| 199 |
for sentence in workflow:
|
| 200 |
sentence.sample_distractors(num_distractors=num_distractors)
|
| 201 |
-
|
| 202 |
-
logs.
|
| 203 |
|
| 204 |
RESULT_TASKS = []
|
| 205 |
for sentence in workflow:
|
|
@@ -219,7 +211,12 @@ def main_workflow_w2v(
|
|
| 219 |
NUMBER_TASKS = 10
|
| 220 |
else:
|
| 221 |
NUMBER_TASKS = len(RESULT_TASKS)
|
| 222 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 223 |
RESULT_TASKS = sorted(RESULT_TASKS, key=lambda t: (t.sentence_number, t.position_in_sentence))
|
| 224 |
|
| 225 |
for task in RESULT_TASKS:
|
|
@@ -243,8 +240,8 @@ def main_workflow_w2v(
|
|
| 243 |
f'{PREPARED_TASKS["KEYS_ONLY"]}'
|
| 244 |
TOTAL_OUT = f'{original_text}\n\n{"$" * 70}\n\n{STUDENT_OUT}\n\n{"=" * 70}\n\n{PREPARED_TASKS["TASKS_TEACHER"]}' \
|
| 245 |
f'\n\n{"$" * 70}\n\n{PREPARED_TASKS["KEYS_ONLY"]}'
|
| 246 |
-
logs.
|
| 247 |
-
|
| 248 |
save_name = save_name if save_name != '' else f'{str(datetime.datetime.now())[:-7]}_{original_text[:20]}'
|
| 249 |
out = {
|
| 250 |
'name': save_name,
|
|
|
|
| 1 |
import datetime
|
| 2 |
from io import StringIO
|
| 3 |
+
from typing import Union
|
| 4 |
from random import sample
|
| 5 |
from collections import defaultdict
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
from streamlit.runtime.uploaded_file_manager import UploadedFile
|
| 7 |
+
from utilities_language_w2v.rus_sentence_w2v import SENTENCE, TASK
|
| 8 |
+
from utilities_language_general.rus_utils import compute_frequency_dict, prepare_target_words, prepare_tasks
|
| 9 |
+
from utilities_language_general.rus_constants import st, load_w2v, load_classifiers, nlp, summarization, BAD_USER_TARGET_WORDS, MINIMUM_SETS
|
| 10 |
|
| 11 |
|
| 12 |
+
def main_workflow(
|
| 13 |
+
file: Union[UploadedFile, None],
|
| 14 |
text: str,
|
| 15 |
+
logs,
|
| 16 |
+
progress,
|
| 17 |
+
progress_d,
|
|
|
|
|
|
|
| 18 |
level: str,
|
| 19 |
tw_mode_automatic_mode: str,
|
| 20 |
target_words: str,
|
|
|
|
| 45 |
student_out, teacher_out, total_out, original_text
|
| 46 |
"""
|
| 47 |
# Clear bad target_words each time
|
| 48 |
+
if global_bad_target_words:
|
| 49 |
+
global_bad_target_words = []
|
| 50 |
|
| 51 |
# Define main global variables
|
|
|
|
| 52 |
GLOBAL_DISTRACTORS = set()
|
| 53 |
MAX_FREQUENCY = 0
|
| 54 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
# Get input text
|
| 56 |
if file is not None:
|
| 57 |
stringio = StringIO(file.getvalue().decode("utf-8"))
|
|
|
|
| 59 |
elif text != '':
|
| 60 |
current_text = text
|
| 61 |
else:
|
| 62 |
+
st.warning('Вы и текст не вставили, и файл не выбрали 😢')
|
| 63 |
current_text = ''
|
| 64 |
+
st.stop()
|
| 65 |
|
| 66 |
# Process target words
|
| 67 |
if tw_mode_automatic_mode == 'Самостоятельно':
|
| 68 |
if target_words == '':
|
| 69 |
+
st.warning('Вы не ввели целевые слова')
|
| 70 |
+
st.stop()
|
| 71 |
# Cannot make up paradigm, so only USER_TARGET_WORDS is used
|
| 72 |
USER_TARGET_WORDS = prepare_target_words(target_words)
|
| 73 |
tw_mode_automatic_mode = False
|
|
|
|
| 80 |
current_text = current_text.replace('.', '. ').replace('. . .', '...').replace(' ', ' ').replace('…', '...') \
|
| 81 |
.replace('…', '...').replace('—', '-').replace('\u2014', '-').replace('—', '-').replace('-\n', '') \
|
| 82 |
.replace('\n', '%^&*')
|
| 83 |
+
current_text_sentences = [sent.text.strip() for sent in nlp(current_text).sents]
|
| 84 |
+
logs.update(label='Получили Ваш текст!', state='running')
|
| 85 |
progress.progress(10)
|
| 86 |
|
| 87 |
# Compute frequency dict
|
|
|
|
| 93 |
if j < len(FREQ_DICT) * _frequency_barrier_percent:
|
| 94 |
MAX_FREQUENCY = tp[1]
|
| 95 |
MAX_FREQUENCY = 3 if MAX_FREQUENCY < 3 else MAX_FREQUENCY
|
| 96 |
+
logs.update(label="Посчитали немного статистики!", state='running')
|
| 97 |
progress.progress(15)
|
| 98 |
|
| 99 |
# Choose necessary language minimum according to user's input
|
| 100 |
+
if level:
|
| 101 |
+
target_minimum, distractor_minimum = MINIMUM_SETS[level]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 102 |
else:
|
| 103 |
target_minimum = None
|
| 104 |
distractor_minimum = None
|
| 105 |
logs.error('Вы не выбрали языковой уровень!')
|
| 106 |
st.stop()
|
| 107 |
|
| 108 |
+
# Define which model is used for distractor generation
|
| 109 |
+
logs.update(label='Загружаем языковые модели и другие данные', state='running')
|
| 110 |
+
if model_name == 'Модель-1':
|
| 111 |
+
mask_filler = load_w2v('model1')
|
| 112 |
+
pos_dict, scaler, classifier = load_classifiers('model1')
|
| 113 |
+
else:
|
| 114 |
+
mask_filler = load_w2v('model2')
|
| 115 |
+
pos_dict, scaler, classifier = load_classifiers('model1')
|
| 116 |
+
|
| 117 |
# Start generation process
|
| 118 |
workflow = [SENTENCE(original=sent.strip(), n_sentence=num, max_num_distractors=num_distractors)
|
| 119 |
for num, sent in enumerate(current_text_sentences)]
|
| 120 |
+
logs.update(label="Запускаем процесс генерации заданий!", state='running')
|
| 121 |
progress.progress(20)
|
| 122 |
|
| 123 |
+
# Define summary length
|
| 124 |
+
text_length = len(current_text_sentences)
|
| 125 |
+
if text_length <= 15:
|
| 126 |
+
summary_length = text_length
|
| 127 |
+
elif text_length <= 25:
|
| 128 |
+
summary_length = 15
|
| 129 |
+
else:
|
| 130 |
+
n = (text_length - 20) // 5
|
| 131 |
+
summary_length = 15 + 2 * n
|
| 132 |
+
round_summary_length = summary_length - (summary_length % - 10)
|
| 133 |
+
|
| 134 |
+
# Get summary. May choose between round_summary_length and summary_length
|
| 135 |
+
SUMMARY = summarization(current_text, num_sentences=round_summary_length)
|
| 136 |
+
logs.success('Нашли интересные предложения. Пригодятся!')
|
| 137 |
+
progress.progress(25)
|
| 138 |
+
|
| 139 |
for sentence in workflow:
|
| 140 |
sentence.lemmatize_sentence()
|
| 141 |
|
| 142 |
for sentence in workflow:
|
| 143 |
sentence.bind_phrases()
|
| 144 |
+
logs.update(label="Подготовили предложения для дальнейшей работы!", state='running')
|
| 145 |
progress.progress(30)
|
| 146 |
|
| 147 |
for j, sentence in enumerate(workflow):
|
|
|
|
| 149 |
target_words_automatic_mode=tw_mode_automatic_mode,
|
| 150 |
target_minimum=target_minimum,
|
| 151 |
user_target_words=USER_TARGET_WORDS,
|
| 152 |
+
frequency_dict=FREQ_DICT,
|
| 153 |
+
summary=SUMMARY)
|
| 154 |
progress.progress(int(30 + (j * (30 / len(workflow)))))
|
| 155 |
+
progress.progress(60)
|
| 156 |
DUPLICATE_TARGET_WORDS = defaultdict(list)
|
| 157 |
for sentence in workflow:
|
| 158 |
for target_word in sentence.target_words:
|
|
|
|
| 165 |
if target_word not in RESULT_TW:
|
| 166 |
global_bad_target_words.append(target_word['original_text'])
|
| 167 |
sentence.target_words.remove(target_word)
|
| 168 |
+
progress.progress(65)
|
| 169 |
+
logs.update(label='Выбрали слова-пропуски!', state='running')
|
| 170 |
|
| 171 |
for sentence in workflow:
|
| 172 |
sentence.attach_distractors_to_target_word(model=mask_filler,
|
| 173 |
+
scaler=scaler,
|
| 174 |
+
classifier=classifier,
|
| 175 |
+
pos_dict=pos_dict,
|
| 176 |
global_distractors=GLOBAL_DISTRACTORS,
|
| 177 |
distractor_minimum=distractor_minimum,
|
| 178 |
level_name=level,
|
| 179 |
max_frequency=MAX_FREQUENCY,
|
| 180 |
+
logs=logs, progress=progress_d)
|
| 181 |
+
progress.progress(70)
|
| 182 |
+
logs.update(label='Подобрали неправильные варианты!', state='running')
|
|
|
|
| 183 |
for sentence in workflow:
|
| 184 |
+
sentence.inflect_distractors(level_name=level)
|
| 185 |
+
progress.progress(80)
|
| 186 |
+
logs.update(label='Просклоняли и проспрягали неправильные варианты!', state='running')
|
| 187 |
|
| 188 |
for sentence in workflow:
|
| 189 |
sentence.filter_target_words(target_words_automatic_mode=tw_mode_automatic_mode)
|
| 190 |
|
| 191 |
for sentence in workflow:
|
| 192 |
sentence.sample_distractors(num_distractors=num_distractors)
|
| 193 |
+
progress.progress(90)
|
| 194 |
+
logs.update(label='Отобрали лучшие задания!', state='running')
|
| 195 |
|
| 196 |
RESULT_TASKS = []
|
| 197 |
for sentence in workflow:
|
|
|
|
| 211 |
NUMBER_TASKS = 10
|
| 212 |
else:
|
| 213 |
NUMBER_TASKS = len(RESULT_TASKS)
|
| 214 |
+
RESULT_TASKS_in_summary = list(filter(lambda task: task.in_summary, RESULT_TASKS))
|
| 215 |
+
RESULT_TASTS_not_in_summary = list(filter(lambda task: not task.in_summary, RESULT_TASKS))
|
| 216 |
+
if len(RESULT_TASKS_in_summary) >= NUMBER_TASKS:
|
| 217 |
+
RESULT_TASKS = RESULT_TASKS_in_summary
|
| 218 |
+
else:
|
| 219 |
+
RESULT_TASKS = RESULT_TASKS_in_summary + sample(RESULT_TASTS_not_in_summary, NUMBER_TASKS - len(RESULT_TASKS_in_summary))
|
| 220 |
RESULT_TASKS = sorted(RESULT_TASKS, key=lambda t: (t.sentence_number, t.position_in_sentence))
|
| 221 |
|
| 222 |
for task in RESULT_TASKS:
|
|
|
|
| 240 |
f'{PREPARED_TASKS["KEYS_ONLY"]}'
|
| 241 |
TOTAL_OUT = f'{original_text}\n\n{"$" * 70}\n\n{STUDENT_OUT}\n\n{"=" * 70}\n\n{PREPARED_TASKS["TASKS_TEACHER"]}' \
|
| 242 |
f'\n\n{"$" * 70}\n\n{PREPARED_TASKS["KEYS_ONLY"]}'
|
| 243 |
+
logs.update(label='Сейчас все будет готово!', state='running')
|
| 244 |
+
progress.progress(90)
|
| 245 |
save_name = save_name if save_name != '' else f'{str(datetime.datetime.now())[:-7]}_{original_text[:20]}'
|
| 246 |
out = {
|
| 247 |
'name': save_name,
|
utilities_language_w2v/rus_sentence_w2v.py
CHANGED
|
@@ -1,16 +1,7 @@
|
|
| 1 |
-
import copy
|
| 2 |
import string
|
| 3 |
-
from random import random
|
| 4 |
-
from
|
| 5 |
-
from utilities_language_general.
|
| 6 |
-
from utilities_language_general.rus_utils import get_tags
|
| 7 |
-
from utilities_language_general.rus_utils import check_token
|
| 8 |
-
from utilities_language_general.rus_constants import PHRASES
|
| 9 |
-
from utilities_language_general.rus_utils import define_gender
|
| 10 |
-
from utilities_language_general.rus_utils import convert_gender
|
| 11 |
-
from utilities_language_general.rus_utils import make_inflection
|
| 12 |
-
from utilities_language_general.rus_constants import BAD_USER_TARGET_WORDS
|
| 13 |
-
from utilities_language_general.rus_utils import get_distractors_from_model
|
| 14 |
|
| 15 |
|
| 16 |
class SENTENCE:
|
|
@@ -48,7 +39,7 @@ class SENTENCE:
|
|
| 48 |
self.sentence_phrases.append(self.sentence_lemma_pos[i][1])
|
| 49 |
previous_was_phrase = False
|
| 50 |
|
| 51 |
-
def search_target_words_automatically(self, model, target_minimum: set, frequency_dict: dict = None):
|
| 52 |
for token in self.sentence_phrases:
|
| 53 |
# TODO: Still do not have w2v model with phrases
|
| 54 |
# therefore cannot come up with the criteria
|
|
@@ -69,7 +60,8 @@ class SENTENCE:
|
|
| 69 |
'tags': tags,
|
| 70 |
'position_in_sentence': self.original.find(original_token1.text),
|
| 71 |
'not_named_entity': not_ner,
|
| 72 |
-
'frequency_in_text': 0
|
|
|
|
| 73 |
}
|
| 74 |
self.target_words.append(target_word)
|
| 75 |
else: # if token is just a spacy.nlp token
|
|
@@ -86,10 +78,11 @@ class SENTENCE:
|
|
| 86 |
'position_in_sentence': self.original.find(token.text),
|
| 87 |
'not_named_entity': True if token.ent_type == 0 else False,
|
| 88 |
'frequency_in_text': frequency_dict.get(token.lemma_, 1),
|
|
|
|
| 89 |
}
|
| 90 |
self.target_words.append(target_word)
|
| 91 |
|
| 92 |
-
def search_user_target_words(self, model, user_target_words: set = None, frequency_dict: dict = None):
|
| 93 |
for _utw in user_target_words:
|
| 94 |
if _utw in self.original:
|
| 95 |
parse_utw = nlp(_utw)
|
|
@@ -114,7 +107,8 @@ class SENTENCE:
|
|
| 114 |
'tags': user_target_word_tags,
|
| 115 |
'position_in_sentence': self.original.find(_utw),
|
| 116 |
'not_named_entity': not_ner,
|
| 117 |
-
'frequency_in_text': frequency_dict.get(user_target_word_lemma, 1)
|
|
|
|
| 118 |
}
|
| 119 |
if not (model.has_index_for(user_target_word_lemma)
|
| 120 |
or model.has_index_for(f'{user_target_word_lemma}_{user_target_word_pos[1]}')):
|
|
@@ -124,25 +118,24 @@ class SENTENCE:
|
|
| 124 |
|
| 125 |
def search_target_words(self, model, target_words_automatic_mode: bool, target_minimum,
|
| 126 |
user_target_words: set = None,
|
| 127 |
-
frequency_dict: dict = None):
|
| 128 |
if target_words_automatic_mode:
|
| 129 |
self.search_target_words_automatically(model=model, target_minimum=target_minimum,
|
| 130 |
-
frequency_dict=frequency_dict)
|
| 131 |
else:
|
| 132 |
self.search_user_target_words(model=model, user_target_words=user_target_words,
|
| 133 |
-
frequency_dict=frequency_dict)
|
| 134 |
|
| 135 |
-
def attach_distractors_to_target_word(self, model,
|
| 136 |
-
max_frequency,
|
| 137 |
-
progress, logs):
|
| 138 |
n_target_words = len(self.target_words)
|
| 139 |
bad_target_words = []
|
| 140 |
for i, target_word in enumerate(self.target_words):
|
| 141 |
pos = target_word['pos'][0] if target_word['pos'][0] == 'phrase' else target_word['pos'][1]
|
| 142 |
-
distractors = get_distractors_from_model(model,
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
distractor_minimum=distractor_minimum,
|
| 146 |
max_num_distractors=self.max_num_distractors)
|
| 147 |
if distractors is None or target_word['frequency_in_text'] > max_frequency:
|
| 148 |
target_word['distractors'] = distractors
|
|
@@ -150,39 +143,29 @@ class SENTENCE:
|
|
| 150 |
target_word['distractors'] = distractors
|
| 151 |
target_word['distractors_number'] = len(distractors) if distractors is not None else 0
|
| 152 |
progress.progress(i / n_target_words)
|
| 153 |
-
logs.
|
|
|
|
| 154 |
for btw in bad_target_words:
|
| 155 |
BAD_USER_TARGET_WORDS.append(btw['original_text'])
|
| 156 |
self.target_words.remove(btw)
|
| 157 |
progress.progress(100)
|
| 158 |
-
logs.
|
| 159 |
-
|
| 160 |
|
| 161 |
-
def inflect_distractors(self):
|
| 162 |
bad_target_words = []
|
| 163 |
for target_word in self.target_words:
|
| 164 |
inflected_distractors = []
|
| 165 |
for distractor_lemma, distractor_similarity in target_word['distractors']:
|
| 166 |
if distractor_lemma.count('_') > 1:
|
| 167 |
# TODO The same. Has to train model and test this code
|
| 168 |
-
inflected = make_inflection(text=distractor_lemma,
|
| 169 |
pos=target_word['pos'][1], tags=target_word['tags'])
|
| 170 |
else:
|
| 171 |
-
inflected = make_inflection(text=distractor_lemma,
|
| 172 |
pos=target_word['pos'][1], tags=target_word['tags'])
|
| 173 |
if inflected is not None:
|
| 174 |
inflected_distractors.append(inflected)
|
| 175 |
-
else:
|
| 176 |
-
new_tags = copy.deepcopy(target_word['tags'])
|
| 177 |
-
if 'NOUN' in target_word['tags'] and 'inan' in target_word['tags']:
|
| 178 |
-
new_tags.discard('inan')
|
| 179 |
-
new_tags.add('anim')
|
| 180 |
-
elif 'NOUN' in target_word['tags'] and 'anim' in target_word['tags']:
|
| 181 |
-
new_tags.discard('anim')
|
| 182 |
-
new_tags.add('inan')
|
| 183 |
-
inflected = make_inflection(text=distractor_lemma, pos=target_word['pos'][1], tags=new_tags)
|
| 184 |
-
if inflected is not None:
|
| 185 |
-
inflected_distractors.append(inflected)
|
| 186 |
num_distractors = min(4, self.max_num_distractors) if self.max_num_distractors >= 4 \
|
| 187 |
else self.max_num_distractors
|
| 188 |
if len(inflected_distractors) < num_distractors:
|
|
|
|
|
|
|
| 1 |
import string
|
| 2 |
+
from random import random, sample
|
| 3 |
+
from utilities_language_general.rus_constants import nlp, PHRASES, BAD_USER_TARGET_WORDS
|
| 4 |
+
from utilities_language_general.rus_utils import get_tags, check_token, define_gender, convert_gender, make_inflection, get_distractors_from_model
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
|
| 6 |
|
| 7 |
class SENTENCE:
|
|
|
|
| 39 |
self.sentence_phrases.append(self.sentence_lemma_pos[i][1])
|
| 40 |
previous_was_phrase = False
|
| 41 |
|
| 42 |
+
def search_target_words_automatically(self, model, target_minimum: set, frequency_dict: dict = None, summary:list=None):
|
| 43 |
for token in self.sentence_phrases:
|
| 44 |
# TODO: Still do not have w2v model with phrases
|
| 45 |
# therefore cannot come up with the criteria
|
|
|
|
| 60 |
'tags': tags,
|
| 61 |
'position_in_sentence': self.original.find(original_token1.text),
|
| 62 |
'not_named_entity': not_ner,
|
| 63 |
+
'frequency_in_text': 0,
|
| 64 |
+
'in_summary': self.original in summary
|
| 65 |
}
|
| 66 |
self.target_words.append(target_word)
|
| 67 |
else: # if token is just a spacy.nlp token
|
|
|
|
| 78 |
'position_in_sentence': self.original.find(token.text),
|
| 79 |
'not_named_entity': True if token.ent_type == 0 else False,
|
| 80 |
'frequency_in_text': frequency_dict.get(token.lemma_, 1),
|
| 81 |
+
'in_summary': self.original in summary
|
| 82 |
}
|
| 83 |
self.target_words.append(target_word)
|
| 84 |
|
| 85 |
+
def search_user_target_words(self, model, user_target_words: set = None, frequency_dict: dict = None, summary:list=None):
|
| 86 |
for _utw in user_target_words:
|
| 87 |
if _utw in self.original:
|
| 88 |
parse_utw = nlp(_utw)
|
|
|
|
| 107 |
'tags': user_target_word_tags,
|
| 108 |
'position_in_sentence': self.original.find(_utw),
|
| 109 |
'not_named_entity': not_ner,
|
| 110 |
+
'frequency_in_text': frequency_dict.get(user_target_word_lemma, 1),
|
| 111 |
+
'in_summary': self.original in summary
|
| 112 |
}
|
| 113 |
if not (model.has_index_for(user_target_word_lemma)
|
| 114 |
or model.has_index_for(f'{user_target_word_lemma}_{user_target_word_pos[1]}')):
|
|
|
|
| 118 |
|
| 119 |
def search_target_words(self, model, target_words_automatic_mode: bool, target_minimum,
|
| 120 |
user_target_words: set = None,
|
| 121 |
+
frequency_dict: dict = None, summary: list=None):
|
| 122 |
if target_words_automatic_mode:
|
| 123 |
self.search_target_words_automatically(model=model, target_minimum=target_minimum,
|
| 124 |
+
frequency_dict=frequency_dict, summary=summary)
|
| 125 |
else:
|
| 126 |
self.search_user_target_words(model=model, user_target_words=user_target_words,
|
| 127 |
+
frequency_dict=frequency_dict, summary=summary)
|
| 128 |
|
| 129 |
+
def attach_distractors_to_target_word(self, model, scaler, classifier, pos_dict, global_distractors,
|
| 130 |
+
distractor_minimum, level_name, max_frequency, logs, progress):
|
|
|
|
| 131 |
n_target_words = len(self.target_words)
|
| 132 |
bad_target_words = []
|
| 133 |
for i, target_word in enumerate(self.target_words):
|
| 134 |
pos = target_word['pos'][0] if target_word['pos'][0] == 'phrase' else target_word['pos'][1]
|
| 135 |
+
distractors = get_distractors_from_model(doc=self.parsed, model=model, scaler=scaler, classifier=classifier, pos_dict=pos_dict,
|
| 136 |
+
target_text=target_word['original_text'], lemma=target_word['lemma'],
|
| 137 |
+
pos=pos, gender=target_word['gender'], lemma_index=target_word['position_in_sentence'],
|
| 138 |
+
global_distractors=global_distractors, distractor_minimum=distractor_minimum, level_name=level_name,
|
| 139 |
max_num_distractors=self.max_num_distractors)
|
| 140 |
if distractors is None or target_word['frequency_in_text'] > max_frequency:
|
| 141 |
target_word['distractors'] = distractors
|
|
|
|
| 143 |
target_word['distractors'] = distractors
|
| 144 |
target_word['distractors_number'] = len(distractors) if distractors is not None else 0
|
| 145 |
progress.progress(i / n_target_words)
|
| 146 |
+
logs.update(label=f'Обработали {i}/{n_target_words} слов в {self.n_sentence + 1}-м предложении',
|
| 147 |
+
state='running')
|
| 148 |
for btw in bad_target_words:
|
| 149 |
BAD_USER_TARGET_WORDS.append(btw['original_text'])
|
| 150 |
self.target_words.remove(btw)
|
| 151 |
progress.progress(100)
|
| 152 |
+
logs.update(label=f'Обработали {n_target_words}/{n_target_words} слов в {self.n_sentence + 1}-м предложении',
|
| 153 |
+
state='running')
|
| 154 |
|
| 155 |
+
def inflect_distractors(self, level_name):
|
| 156 |
bad_target_words = []
|
| 157 |
for target_word in self.target_words:
|
| 158 |
inflected_distractors = []
|
| 159 |
for distractor_lemma, distractor_similarity in target_word['distractors']:
|
| 160 |
if distractor_lemma.count('_') > 1:
|
| 161 |
# TODO The same. Has to train model and test this code
|
| 162 |
+
inflected = make_inflection(text=distractor_lemma, level=level_name,
|
| 163 |
pos=target_word['pos'][1], tags=target_word['tags'])
|
| 164 |
else:
|
| 165 |
+
inflected = make_inflection(text=distractor_lemma, level=level_name,
|
| 166 |
pos=target_word['pos'][1], tags=target_word['tags'])
|
| 167 |
if inflected is not None:
|
| 168 |
inflected_distractors.append(inflected)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 169 |
num_distractors = min(4, self.max_num_distractors) if self.max_num_distractors >= 4 \
|
| 170 |
else self.max_num_distractors
|
| 171 |
if len(inflected_distractors) < num_distractors:
|