In order to provide an unbiased accuracy comparison of our Language Identification API, we decided to test it against the Europarl corpus. It is a corpus that contains 21 languages each having 1000 texts, randomly sampled from the Europarl corpus. It is the same corpus used by Mike McCandless to compare the accuracy of CLD (Compact Language Detector), Tika and language-detection. For completeness, we also included langid.py in our comparison test.
Tika is missing Bulgarian (bg), Czech (cs), Lithuanian (lt) and Latvian (lv), for our comparison we will only test on the remaining 17 languages that all language detectors support.
da | Danish | de | German | el | Greek |
---|---|---|---|---|---|
en | English | es | Spanish | et | Estonian |
fi | Finnish | fr | French | hu | Hungarian |
it | Italian | nl | Dutch | pl | Polish |
pt | Portuguese | ro | Romanian | sk | Slovak |
sl | Slovene | sv | Swedish |
This test is challenging, as many of the texts are very short: the shortest is 25 bytes, and 290 (1.7%) of the 17000 are 30 bytes or less.
In addition to the challenges of the corpora, there is also the difference in the number of languages each language detector can identify. WhatLanguage.net detects 110 languages, langid.py detects 97 languages, CLD detects at least 76 languages, while language-detection detects 53 and Tika detects 27. This biases slightly against our WhatLanguage.net API and langid.py, since the classification task becomes harder with more supported languages.
The results of CLD, Tika and language-detection are taken from the comparison test of Mike McCandless. We used version 1.1.3 of langid.py.
Below are the test results:
WhatLanguage.net API results (total 99.79% = 16965 / 17000):
da | 98.5% | da=985 | nb=15 | |||
de | 99.9% | de=999 | sv=1 | |||
el | 100.0% | el=1000 | ||||
en | 100.0% | en=1000 | ||||
es | 99.5% | es=995 | gl=3 | af=1 | ia=1 | |
et | 100.0% | et=1000 | ||||
fi | 99.9% | fi=999 | et=1 | |||
fr | 99.6% | fr=996 | lb=1 | mg=1 | pt=1 | vo=1 |
hu | 100.0% | hu=1000 | ||||
it | 99.9% | it=999 | bg=1 | |||
nl | 99.7% | nl=997 | li=2 | fy=1 | ||
pl | 100.0% | pl=1000 | ||||
pt | 99.8% | pt=998 | gl=2 | |||
ro | 100.0% | ro=1000 | ||||
sk | 99.7% | sk=997 | cs=3 | |||
sl | 100.0% | sl=1000 | ||||
sv | 100.0% | sv=1000 |
For completeness, the results for the four languages (Bulgarian (bg), Czech (cs), Lithuanian (lt) and Latvian (lv)) excluded from this test:
bg | 99.8% | bg=998 | uk=1 | mk=1 |
cs | 100.0% | cs=1000 | ||
lt | 100.0% | lt=1000 | ||
lv | 99.9% | lv=999 | pt=1 |
langid.py results (total 99.16% = 16858 / 17000):
da | 95.4% | da=954 | no=29 | nb=14 | nl=2 | pt=1 | |||
de | 99.8% | de=998 | en=1 | fi=1 | |||||
el | 100.0% | el=1000 | |||||||
en | 99.9% | en=999 | es=1 | ||||||
es | 99.4% | es=994 | gl=5 | it=1 | |||||
et | 98.6% | et=986 | fi=7 | it=2 | en=1 | id=1 | pt=1 | se=1 | sv=1 |
fi | 99.7% | fi=997 | da=1 | de=1 | no=1 | ||||
fr | 99.5% | fr=995 | es=2 | ku=1 | lb=1 | oc=1 | |||
hu | 99.9% | hu=999 | sv=1 | ||||||
it | 99.8% | it=998 | de=1 | eu=1 | |||||
nl | 99.9% | nl=999 | af=1 | ||||||
pl | 99.9% | pl=999 | en=1 | ||||||
pt | 99.3% | pt=993 | gl=5 | es=1 | oc=1 | ||||
ro | 99.9% | ro=999 | fr=1 | ||||||
sk | 98.4% | sk=984 | cs=11 | sl=2 | ga=1 | lt=1 | pt=1 | ||
sl | 97.3% | sl=973 | hr=17 | bs=5 | it=2 | hu=1 | lt=1 | sv=1 | |
sv | 99.1% | sv=991 | no=8 | nb=1 |
CLD results (total 98.82% = 16800 / 17000):
da | 93.4% | da=934 | nb=54 | sv=5 | fr=2 | eu=2 | is=1 | hr=1 | en=1 | |
de | 99.6% | de=996 | en=2 | ga=1 | cy=1 | |||||
el | 100.0% | el=1000 | ||||||||
en | 100.0% | en=1000 | ||||||||
es | 98.3% | es=983 | pt=4 | gl=3 | en=3 | it=2 | eu=2 | id=1 | fi=1 | da=1 |
et | 99.6% | et=996 | ro=1 | id=1 | fi=1 | en=1 | ||||
fi | 100.0% | fi=1000 | ||||||||
fr | 99.2% | fr=992 | en=4 | sq=2 | de=1 | ca=1 | ||||
hu | 99.9% | hu=999 | it=1 | |||||||
it | 99.5% | it=995 | ro=1 | mt=1 | id=1 | fr=1 | eu=1 | |||
nl | 99.5% | nl=995 | af=3 | sv=1 | et=1 | |||||
pl | 99.6% | pl=996 | tr=1 | sw=1 | nb=1 | en=1 | ||||
pt | 98.7% | pt=987 | gl=4 | es=3 | mt=1 | it=1 | is=1 | ht=1 | fi=1 | en=1 |
ro | 99.8% | ro=998 | da=1 | ca=1 | ||||||
sk | 98.8% | sk=988 | cs=9 | en=2 | de=1 | |||||
sl | 95.1% | sl=951 | hr=32 | sr=8 | sk=5 | en=2 | id=1 | cs=1 | ||
sv | 99.0% | sv=990 | nb=9 | en=1 |
Tika results (total 97.12% = 16510 / 17000):
da | 87.6% | da=876 | no=112 | nl=4 | sv=3 | it=1 | fr=1 | et=1 | en=1 | de=1 | ||||
de | 98.5% | de=985 | nl=3 | it=3 | da=3 | sv=2 | fr=2 | sl=1 | ca=1 | |||||
el | 100.0% | el=1000 | ||||||||||||
en | 96.9% | en=969 | no=10 | it=6 | ro=4 | sk=3 | fr=3 | hu=2 | et=2 | sv=1 | ||||
es | 89.8% | es=898 | gl=47 | pt=22 | ca=15 | it=6 | eo=4 | fr=3 | fi=2 | sk=1 | nl=1 | et=1 | ||
et | 99.1% | et=991 | fi=4 | fr=2 | sl=1 | no=1 | ca=1 | |||||||
fi | 99.4% | fi=994 | et=5 | hu=1 | ||||||||||
fr | 98.0% | fr=980 | sl=6 | eo=3 | et=2 | sk=1 | ro=1 | no=1 | it=1 | gl=1 | fi=1 | es=1 | de=1 | ca=1 |
hu | 99.9% | hu=999 | ca=1 | |||||||||||
it | 99.4% | it=994 | eo=4 | pt=1 | fr=1 | |||||||||
nl | 97.8% | nl=978 | no=8 | de=3 | da=3 | sl=2 | ro=2 | pl=1 | it=1 | gl=1 | et=1 | |||
pl | 99.1% | pl=991 | sl=3 | sk=2 | ro=1 | it=1 | hu=1 | fi=1 | ||||||
pt | 94.4% | pt=944 | gl=48 | hu=2 | ca=2 | it=1 | et=1 | es=1 | en=1 | |||||
ro | 99.3% | ro=993 | is=2 | sl=1 | pl=1 | it=1 | hu=1 | fr=1 | ||||||
sk | 96.2% | sk=962 | sl=21 | pl=13 | it=2 | ro=1 | et=1 | |||||||
sl | 98.5% | sl=985 | sk=7 | et=4 | it=2 | pt=1 | no=1 | |||||||
sv | 97.1% | sv=971 | no=15 | nl=6 | da=6 | de=1 | ca=1 |
Language-detection results (total 99.22% = 16868 / 17000):
da | 97.1% | da=971 | no=28 | en=1 | |||
de | 99.8% | de=998 | da=1 | af=1 | |||
el | 100.0% | el=1000 | |||||
en | 99.7% | en=997 | nl=1 | fr=1 | af=1 | ||
es | 99.5% | es=995 | pt=4 | en=1 | |||
et | 99.6% | et=996 | fi=2 | de=1 | af=1 | ||
fi | 99.8% | fi=998 | et=2 | ||||
fr | 99.8% | fr=998 | sv=1 | it=1 | |||
hu | 99.9% | hu=999 | id=1 | ||||
it | 99.8% | it=998 | es=2 | ||||
nl | 97.7% | nl=977 | af=21 | sv=1 | de=1 | ||
pl | 99.9% | pl=999 | nl=1 | ||||
pt | 99.4% | pt=994 | es=3 | it=1 | hu=1 | en=1 | |
ro | 99.9% | ro=999 | fr=1 | ||||
sk | 98.7% | sk=987 | cs=8 | sl=2 | ro=1 | lt=1 | et=1 |
sl | 97.2% | sl=972 | hr=27 | en=1 | |||
sv | 99.0% | sv=990 | no=8 | da=2 |
Conclusion
Our Language Identification API has the best accuracy. Closely followed by the language-detection and langid.py libraries. So if you need accurate language identification in your application, you should test our API without any risk and see for yourself. Besides being very accurate, our API web service also detects 110 languages and can be easily called by any programming language due to its JSON or XML formatted output.