Sign up now and receive 2000 free API calls or 1 million free characters.
This allows you to test and evaluate our language identification web service without any risk.

In order to provide an unbiased accuracy comparison of our Language Identification API, we decided to test it against the Europarl corpus. It is a corpus that contains 21 languages each having 1000 texts, randomly sampled from the Europarl corpus. It is the same corpus used by Mike McCandless to compare the accuracy of CLD (Compact Language Detector), Tika and language-detection. For completeness, we also included langid.py in our comparison test.

Tika is missing Bulgarian (bg), Czech (cs), Lithuanian (lt) and Latvian (lv), for our comparison we will only test on the remaining 17 languages that all language detectors support.

da Danish de German el Greek
en English es Spanish et Estonian
fi Finnish fr French hu Hungarian
it Italian nl Dutch pl Polish
pt Portuguese ro Romanian sk Slovak
sl Slovene sv Swedish    

This test is challenging, as many of the texts are very short: the shortest is 25 bytes, and 290 (1.7%) of the 17000 are 30 bytes or less.

In addition to the challenges of the corpora, there is also the difference in the number of languages each language detector can identify. WhatLanguage.net detects 110 languages, langid.py detects 97 languages, CLD detects at least 76 languages, while language-detection detects 53 and Tika detects 27. This biases slightly against our WhatLanguage.net API and langid.py, since the classification task becomes harder with more supported languages.

The results of CLD, Tika and language-detection are taken from the comparison test of Mike McCandless. We used version 1.1.3 of langid.py.


Below are the test results:

WhatLanguage.net API results (total 99.79% = 16965 / 17000):

     da  98.5%   da=985  nb=15      
     de  99.9%   de=999  sv=1      
     el  100.0%   el=1000        
     en  100.0%   en=1000        
     es  99.5%   es=995  gl=3  af=1  ia=1  
     et  100.0%   et=1000        
     fi  99.9%   fi=999  et=1      
     fr  99.6%   fr=996  lb=1  mg=1  pt=1  vo=1
     hu  100.0%   hu=1000        
     it  99.9%   it=999  bg=1      
     nl  99.7%   nl=997  li=2  fy=1    
     pl  100.0%   pl=1000        
     pt  99.8%   pt=998  gl=2      
     ro  100.0%   ro=1000        
     sk  99.7%   sk=997  cs=3      
     sl  100.0%   sl=1000        
     sv  100.0%   sv=1000        


For completeness, the results for the four languages (Bulgarian (bg), Czech (cs), Lithuanian (lt) and Latvian (lv)) excluded from this test:

     bg  99.8%   bg=998  uk=1  mk=1
     cs  100.0%   cs=1000    
     lt  100.0%   lt=1000    
     lv  99.9%   lv=999  pt=1  


langid.py results (total 99.16% = 16858 / 17000):

     da  95.4%   da=954  no=29  nb=14  nl=2  pt=1      
     de  99.8%   de=998  en=1  fi=1          
     el  100.0%   el=1000              
     en  99.9%   en=999  es=1            
     es  99.4%   es=994  gl=5  it=1          
     et  98.6%   et=986  fi=7  it=2  en=1  id=1  pt=1  se=1  sv=1
     fi  99.7%   fi=997  da=1  de=1  no=1        
     fr  99.5%   fr=995  es=2  ku=1  lb=1  oc=1      
     hu  99.9%   hu=999  sv=1            
     it  99.8%   it=998  de=1  eu=1          
     nl  99.9%   nl=999  af=1            
     pl  99.9%   pl=999  en=1            
     pt  99.3%   pt=993  gl=5  es=1  oc=1        
     ro  99.9%   ro=999  fr=1            
     sk  98.4%   sk=984  cs=11  sl=2  ga=1  lt=1  pt=1    
     sl  97.3%   sl=973  hr=17  bs=5  it=2  hu=1  lt=1  sv=1  
     sv  99.1%   sv=991  no=8  nb=1          


CLD results (total 98.82% = 16800 / 17000):

     da  93.4%   da=934  nb=54  sv=5  fr=2  eu=2  is=1  hr=1  en=1  
     de  99.6%   de=996  en=2  ga=1  cy=1          
     el  100.0%   el=1000                
     en  100.0%   en=1000                
     es  98.3%   es=983  pt=4  gl=3  en=3  it=2  eu=2  id=1  fi=1  da=1
     et  99.6%   et=996  ro=1  id=1  fi=1  en=1        
     fi  100.0%   fi=1000                
     fr  99.2%   fr=992  en=4  sq=2  de=1  ca=1        
     hu  99.9%   hu=999  it=1              
     it  99.5%   it=995  ro=1  mt=1  id=1  fr=1  eu=1      
     nl  99.5%   nl=995  af=3  sv=1  et=1          
     pl  99.6%   pl=996  tr=1  sw=1  nb=1  en=1        
     pt  98.7%   pt=987  gl=4  es=3  mt=1  it=1  is=1  ht=1  fi=1  en=1
     ro  99.8%   ro=998  da=1  ca=1            
     sk  98.8%   sk=988  cs=9  en=2  de=1          
     sl  95.1%   sl=951  hr=32  sr=8  sk=5  en=2  id=1  cs=1    
     sv  99.0%   sv=990  nb=9  en=1            


Tika results (total 97.12% = 16510 / 17000):

     da  87.6%   da=876  no=112  nl=4  sv=3  it=1  fr=1  et=1  en=1  de=1        
     de  98.5%   de=985  nl=3  it=3  da=3  sv=2  fr=2  sl=1  ca=1          
     el  100.0%   el=1000                        
     en  96.9%   en=969  no=10  it=6  ro=4  sk=3  fr=3  hu=2  et=2  sv=1        
     es  89.8%   es=898  gl=47  pt=22  ca=15  it=6  eo=4  fr=3  fi=2  sk=1  nl=1  et=1    
     et  99.1%   et=991  fi=4  fr=2  sl=1  no=1  ca=1              
     fi  99.4%   fi=994  et=5  hu=1                    
     fr  98.0%   fr=980  sl=6  eo=3  et=2  sk=1  ro=1  no=1  it=1  gl=1  fi=1  es=1  de=1  ca=1
     hu  99.9%   hu=999  ca=1                      
     it  99.4%   it=994  eo=4  pt=1  fr=1                  
     nl  97.8%   nl=978  no=8  de=3  da=3  sl=2  ro=2  pl=1  it=1  gl=1  et=1      
     pl  99.1%   pl=991  sl=3  sk=2  ro=1  it=1  hu=1  fi=1            
     pt  94.4%   pt=944  gl=48  hu=2  ca=2  it=1  et=1  es=1  en=1          
     ro  99.3%   ro=993  is=2  sl=1  pl=1  it=1  hu=1  fr=1            
     sk  96.2%   sk=962  sl=21  pl=13  it=2  ro=1  et=1              
     sl  98.5%   sl=985  sk=7  et=4  it=2  pt=1  no=1              
     sv  97.1%   sv=971  no=15  nl=6  da=6  de=1  ca=1              


Language-detection results (total 99.22% = 16868 / 17000):

     da  97.1%   da=971  no=28  en=1      
     de  99.8%   de=998  da=1  af=1      
     el  100.0%   el=1000          
     en  99.7%   en=997  nl=1  fr=1  af=1    
     es  99.5%   es=995  pt=4  en=1      
     et  99.6%   et=996  fi=2  de=1  af=1    
     fi  99.8%   fi=998  et=2        
     fr  99.8%   fr=998  sv=1  it=1      
     hu  99.9%   hu=999  id=1        
     it  99.8%   it=998  es=2        
     nl  97.7%   nl=977  af=21  sv=1  de=1    
     pl  99.9%   pl=999  nl=1        
     pt  99.4%   pt=994  es=3  it=1  hu=1  en=1  
     ro  99.9%   ro=999  fr=1        
     sk  98.7%   sk=987  cs=8  sl=2  ro=1  lt=1  et=1
     sl  97.2%   sl=972  hr=27  en=1      
     sv  99.0%   sv=990  no=8  da=2      

Conclusion

Our Language Identification API has the best accuracy. Closely followed by the language-detection and langid.py libraries. So if you need accurate language identification in your application, you should test our API without any risk and see for yourself. Besides being very accurate, our API web service also detects 110 languages and can be easily called by any programming language due to its JSON or XML formatted output.

Sign up now and receive 2000 free API calls or 1 million free characters.
This allows you to test and evaluate our language identification web service without any risk.

Cookies help us deliver our services. By using our services, you agree to our use of cookies.