Page 1 of 1

Improvements for Gaelic spell checker

Posted: Wed Jun 13, 2018 11:59 am
by Pander
I would like to report some improvements for the Gaelic spell checker support for Hunspell. Please, consider the following:

a) Remove words from gd_GB.dic that incorrectly contain numerals:

Code: Select all

ch1oirli-gheilc
https://cgit.freedesktop.org/libreoffic ... dic#n63825

b) Remove language codes from from gd_GB.dic as these are ISO codes and result in the only word containing an underscore (and underscore can trigger a tokenisation during the spell checking of that word). For full support, all ISO codes could be added, or add none at all. See https://salsa.debian.org/iso-codes-team/iso-codes for upstream collection of ISO codes.

Code: Select all

gd-GB
https://cgit.freedesktop.org/libreoffic ... ic#n328040

Code: Select all

gd_GB
https://cgit.freedesktop.org/libreoffic ... ic#n328038

c) Add character ⁊ ( https://en.wikipedia.org/wiki/%E2%81%8A ) to WORDCHARS and possibly to TRY in gd_GB.aff because (similar to the underscore) this can cause tokenisation of the word in the spell checking output.

Code: Select all

L⁊L
https://cgit.freedesktop.org/libreoffic ... dic#n10290

Code: Select all

⁊c
https://cgit.freedesktop.org/libreoffic ... ic#n325951

When these issues have been resolved, please offer a new version for Gaelic spell checking at LibreOffice via https://bugs.documentfoundation.org/bug ... lution=---

PS Another issue I have found is when searching for ⁊ as "Part of word" in http://www.faclair.com/index.aspx?Language=en results in the following error

Code: Select all

Server Error in '/' Application.
Got error 'repetition-operator operand invalid' from regexp
Description: An unhandled exception occurred during the execution of the current web request. Please review the stack trace for more information about the error and where it originated in the code.
Exception Details: MySql.Data.MySqlClient.MySqlException: Got error 'repetition-operator operand invalid' from regexp
...

Improvements for Gaelic spell checker

Posted: Wed Jun 13, 2018 12:15 pm
by Pander
d) Consider adding 0123456789 to WORDCHARS and TRY in gd_GB.aff in order to better support dictionary words containing numeral. Currently these type of words in gd_GB.dic are: 1810an, 1830an, 1840an, 1850an, 1860an, 1870an, 1880an, 1890an, 1910an,1920an, 1930an, 1940an, 1950an, 1960an, 1970an, 1980an, 1990an, 20an, 2D, 30an, 3D, 40an, 50an, 60an, 70an, 80an, 90an, Channel4, ITV1, ITV2, ITV3, ITV4, MI5, TG4 Fixing this will improve suggestions and lessen the risk of word tokenisation during spell checking.

Improvements for Gaelic spell checker

Posted: Wed Jun 13, 2018 1:19 pm
by akerbeltz
Hi Pander (are you user pander on sourceforge by any chance?)
ch1oirli-gheilc
That was a typo, fixed at source, will be fixed during next update.
gd-GB/gd_GB
Removed, wasn't aware that _ triggers tokenisation
⁊ 123...
Have added these to TRY and WORDCHARS though not entirely sure of effects - will see during next build.
Aware of the search bug (there are others but our main dev is scaling mountains in Colorado just now) in the Faclair Beag but thanks for reporting!

A big-ish overhaul is about to start, as we're also going to tackle hyphenation and possible a thesaurus, so the fixes will appear in time but not immediately as I usually dogfood new versions for a few weeks before letting them loose on Gaelking 8-)

Thanks very much for all your great input!

Improvements for Gaelic spell checker

Posted: Wed Jun 13, 2018 4:38 pm
by Pander
Yes, that Pander. :-)

e) Not sure if this equals sign belong in the following word in the dictionary:

Code: Select all

coirce=fiadhain
https://cgit.freedesktop.org/libreoffic ... dic#n99525

f) The dictionary has 5 words with U+2011, a non-breaking hyphen. For hyphenation of words, other facilities exist. Looking at similar words in the dictionary, these characters are supposed to be normal hyphens:

Code: Select all

capsailean‑fànais
https://cgit.freedesktop.org/libreoffic ... dic#n60574

Code: Select all

capsaile‑fànais
https://cgit.freedesktop.org/libreoffic ... dic#n60575

Code: Select all

chapsailean‑fànais
https://cgit.freedesktop.org/libreoffic ... dic#n66636

Code: Select all

chapsaile‑fànais
https://cgit.freedesktop.org/libreoffic ... dic#n66637

Code: Select all

àiteachais‑àrainneachd
https://cgit.freedesktop.org/libreoffic ... ic#n321591

g) Before releasing a new version, make a histogram of all the characters used in the .dic and .aff file to double check of no unwanted characters were included. If you want, I have a practical Python script for that.

Improvements for Gaelic spell checker

Posted: Wed Jun 13, 2018 4:51 pm
by akerbeltz
:naire: all fixed, many thanks!

Yes that would be useful to have, I'll pm you my email