Improvements for Gaelic spell checker

Pròiseactan mar Google in Your Language, an Uicipeid, Firefox, Thunderbird, Opera, LibreOffice/OpenOffice, phpBB, Scrabble is dearbhadairean / Projects like Google in Your Language, Wikipedia, Firefox, Thunderbird, Opera, LibreOffice/OpenOffice, phpBB, Scrabble and spell checkers
Pander
Posts: 3
Joined: Wed Jun 13, 2018 8:55 am
Language Level: non-existing
Corrections: I'm fine either way
Location: Netherlands

Improvements for Gaelic spell checker

Unread postby Pander » Wed Jun 13, 2018 11:59 am

I would like to report some improvements for the Gaelic spell checker support for Hunspell. Please, consider the following:

a) Remove words from gd_GB.dic that incorrectly contain numerals:

Code: Select all

ch1oirli-gheilc
https://cgit.freedesktop.org/libreoffic ... dic#n63825

b) Remove language codes from from gd_GB.dic as these are ISO codes and result in the only word containing an underscore (and underscore can trigger a tokenisation during the spell checking of that word). For full support, all ISO codes could be added, or add none at all. See https://salsa.debian.org/iso-codes-team/iso-codes for upstream collection of ISO codes.

Code: Select all

gd-GB
https://cgit.freedesktop.org/libreoffic ... ic#n328040

Code: Select all

gd_GB
https://cgit.freedesktop.org/libreoffic ... ic#n328038

c) Add character ⁊ ( https://en.wikipedia.org/wiki/%E2%81%8A ) to WORDCHARS and possibly to TRY in gd_GB.aff because (similar to the underscore) this can cause tokenisation of the word in the spell checking output.

Code: Select all

L⁊L
https://cgit.freedesktop.org/libreoffic ... dic#n10290

Code: Select all

⁊c
https://cgit.freedesktop.org/libreoffic ... ic#n325951

When these issues have been resolved, please offer a new version for Gaelic spell checking at LibreOffice via https://bugs.documentfoundation.org/bug ... lution=---

PS Another issue I have found is when searching for ⁊ as "Part of word" in http://www.faclair.com/index.aspx?Language=en results in the following error

Code: Select all

Server Error in '/' Application.
Got error 'repetition-operator operand invalid' from regexp
Description: An unhandled exception occurred during the execution of the current web request. Please review the stack trace for more information about the error and where it originated in the code.
Exception Details: MySql.Data.MySqlClient.MySqlException: Got error 'repetition-operator operand invalid' from regexp
...



Pander
Posts: 3
Joined: Wed Jun 13, 2018 8:55 am
Language Level: non-existing
Corrections: I'm fine either way
Location: Netherlands

Improvements for Gaelic spell checker

Unread postby Pander » Wed Jun 13, 2018 12:15 pm

d) Consider adding 0123456789 to WORDCHARS and TRY in gd_GB.aff in order to better support dictionary words containing numeral. Currently these type of words in gd_GB.dic are: 1810an, 1830an, 1840an, 1850an, 1860an, 1870an, 1880an, 1890an, 1910an,1920an, 1930an, 1940an, 1950an, 1960an, 1970an, 1980an, 1990an, 20an, 2D, 30an, 3D, 40an, 50an, 60an, 70an, 80an, 90an, Channel4, ITV1, ITV2, ITV3, ITV4, MI5, TG4 Fixing this will improve suggestions and lessen the risk of word tokenisation during spell checking.

User avatar
akerbeltz
Rianaire
Posts: 1699
Joined: Mon Nov 17, 2008 2:26 am
Language Level: Barail am broinn baraille
Corrections: Please don't analyse my Gaelic
Location: Glaschu
Contact:

Improvements for Gaelic spell checker

Unread postby akerbeltz » Wed Jun 13, 2018 1:19 pm

Hi Pander (are you user pander on sourceforge by any chance?)

ch1oirli-gheilc

That was a typo, fixed at source, will be fixed during next update.
gd-GB/gd_GB

Removed, wasn't aware that _ triggers tokenisation

⁊ 123...

Have added these to TRY and WORDCHARS though not entirely sure of effects - will see during next build.
Aware of the search bug (there are others but our main dev is scaling mountains in Colorado just now) in the Faclair Beag but thanks for reporting!

A big-ish overhaul is about to start, as we're also going to tackle hyphenation and possible a thesaurus, so the fixes will appear in time but not immediately as I usually dogfood new versions for a few weeks before letting them loose on Gaelking 8-)

Thanks very much for all your great input!

Pander
Posts: 3
Joined: Wed Jun 13, 2018 8:55 am
Language Level: non-existing
Corrections: I'm fine either way
Location: Netherlands

Improvements for Gaelic spell checker

Unread postby Pander » Wed Jun 13, 2018 4:38 pm

Yes, that Pander. :-)

e) Not sure if this equals sign belong in the following word in the dictionary:

Code: Select all

coirce=fiadhain
https://cgit.freedesktop.org/libreoffic ... dic#n99525

f) The dictionary has 5 words with U+2011, a non-breaking hyphen. For hyphenation of words, other facilities exist. Looking at similar words in the dictionary, these characters are supposed to be normal hyphens:

Code: Select all

capsailean‑fànais
https://cgit.freedesktop.org/libreoffic ... dic#n60574

Code: Select all

capsaile‑fànais
https://cgit.freedesktop.org/libreoffic ... dic#n60575

Code: Select all

chapsailean‑fànais
https://cgit.freedesktop.org/libreoffic ... dic#n66636

Code: Select all

chapsaile‑fànais
https://cgit.freedesktop.org/libreoffic ... dic#n66637

Code: Select all

àiteachais‑àrainneachd
https://cgit.freedesktop.org/libreoffic ... ic#n321591

g) Before releasing a new version, make a histogram of all the characters used in the .dic and .aff file to double check of no unwanted characters were included. If you want, I have a practical Python script for that.

User avatar
akerbeltz
Rianaire
Posts: 1699
Joined: Mon Nov 17, 2008 2:26 am
Language Level: Barail am broinn baraille
Corrections: Please don't analyse my Gaelic
Location: Glaschu
Contact:

Improvements for Gaelic spell checker

Unread postby akerbeltz » Wed Jun 13, 2018 4:51 pm

:naire: all fixed, many thanks!

Yes that would be useful to have, I'll pm you my email