Create a Solr filter that replaces diacritics
Sebastian Mitroi in
Java, General
Some languages (like Romanian) have special characters (diacritics, often called accent marks). It’s generally useful to remove diacritic marks from characters, for example when you create an index with Solr. You don’t want to index text with these characters because you want to find for example both words “proprietăţi” and “proprietati”. If you are using Solr to index your text you have to create a Solr filter.
First of all you have to put the filter in the schema.xml configuration file :
<fieldtype name="text_st" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> // ... some other filters for example lower case filter <filter class="solr.LowerCaseFilterFactory"/><filter class="ro.tremend.solr.diacritics.DiacriticsFilterFactory"/></analyzer> </fieldtype>
Then create 3 small classes and a properties file. The filter factory for Solr DiacriticsFilterFactory :
package ro.tremend.solr.diacritics;
import org.apache.lucene.analysis.TokenStream;
import org.apache.solr.analysis.BaseTokenFilterFactory;
/**
* Create a Solr Filter Factory for diacritics
*
* @author Sebastian
*
*/
public class DiacriticsFilterFactory extends BaseTokenFilterFactory {
public TokenStream create(TokenStream input) {
return new DiacriticsFilter(input);
}
}
Now you have to create the filter class DiacriticsFilter :
package ro.tremend.solr.diacritics;
import org.apache.lucene.analysis.*;
import java.io.IOException;
/**
* Create the diacritics filter
*
* @author Sebastian
*
*/
public final class DiacriticsFilter extends TokenFilter {
public DiacriticsFilter(TokenStream in) {
super(in);
}
public final Token next() throws IOException {
Token t = input.next();
if (t == null)
return null;
t.setTermText(DiacriticsUtils.replaceDiacritics(t.termText()));
return t;
}
}
and finally the class that does the work DiacriticsUtils :
package ro.tremend.solr.diacritics;
import java.util.HashMap;
import java.util.Map;
import java.util.MissingResourceException;
import java.util.ResourceBundle;
import java.util.Set;
/**
* Replace romanian characters
*
* @author Sebastian
*
*/
public class DiacriticsUtils {
private static Map
This class needs a properties file with the diacritics you want to replace:
diacritics.properties
\\u0102=A
\\u0103=a
... define all your language specific characters
Now the index will not contain diacritics, but you have to remove the diacritics from the query too. To do that just write this:
textToFind = DiacriticsUtils.replaceDiacritics(textToFind);
I hope this will help.
August 29th, 2007 at 9:33 am
Unles you have a very specific need to define a custom list of diacritics to remove, the ISOLatin1AccentFilter that comes with Lucene (and the ISOLatin1AccentFilterFactory that comes with Solr) should solve this problem for you without any custom code.
(ISOLatin1AccentFilter has had a lot of speed improvements in the trunk which should make it into the next version of Solr as well … see LUCENE-871 in apache Jira for more info)
October 28th, 2007 at 4:52 am
Hello
Great example
I am new to solr and trying to understand as to where i should be applying this last update mentioned
—————————————————————
Now the index will not contain diacritics, but you have to remove the diacritics from the query too. To do that just write this:
textToFind = DiacriticsUtils.replaceDiacritics(textToFind);
————————————————————–
I am building an application where I would need to replace an incoming string with another specific to our taxonmy
Please advice
December 6th, 2007 at 9:04 pm
Even nicer would be to go beyond this Accents filter and to also use the romanian stemmer filter from Snowball.
For example you search for:
masini (romanian word for cars)
and you also get results for:
masina (romanian word for car)
Example:
http://www.egirl.ro/search?query=masini
Keep up the good work.
January 23rd, 2008 at 5:26 pm
You can also replace diacritics with the base characters using java.text.Normalizer.
This is useful when you don’t know which diacritics can appear. This code will always extract the base character.
This is the code:
public String toBase(String sText){
boolean bChar = true;
int iSize = sText.length(), i = 0;
String sAux = “”;
for (i=0; i < iSize; i++){
String sLetter = new String(new char[]{sText.charAt(i)});
sLetter = Normalizer.normalize(sLetter, Normalizer.Form.NFD);
try{
byte[] bLetter = (new String(sLetter)).getBytes(“UTF-8″);
char cLetter = (char) bLetter[0];
sAux += “” + cLetter;
}
catch(Exception e){
//do something
}
}
return sAux;
}
January 24th, 2008 at 2:25 pm
thanks for suggestion.
February 2nd, 2008 at 5:29 pm
I’m trying to add a factory in solr for tokenizing Arabic text, but I
receive some error (the one at the last of my email). can u help me solve this problem please.
Here is my code:
package org.apache.solr.analysis;
import gpl.pierrick.brihaye.aramorph.lucene.ArabicTokenizer;
import java.io.Reader;
import org.apache.lucene.analysis.TokenStream;
import org.apache.solr.analysis.BaseTokenizerFactory;
public class ArabicTokenizerFactory extends BaseTokenizerFactory{
public TokenStream create(Reader input) {
return new ArabicTokenizer(input);
}
}
Thanks in advance
HTTP Status 500 – Severe errors in solr configuration. Check your log
files for more detailed information on what may be wrong. If you want
solr to continue after configuration errors, change:
false in
solrconfig.xml
————————————————————-
java.lang.VerifyError: (class:
org/apache/solr/analysis/ArabicTokenizerFactory, method: create
signature: (Ljava/io/Reader;)Lorg/apache/lucene/analysis/TokenStream;)
Wrong return type in function at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Unknown Source) at
org.apache.solr.core.Config.findClass(Config.java:308) at
org.apache.solr.core.Config.newInstance(Config.java:319) at
org.apache.solr.schema.IndexSchema.readTokenizerFactory(IndexSchema.java
:631) at
org.apache.solr.schema.IndexSchema.readAnalyzer(IndexSchema.java:605) at
org.apache.solr.schema.IndexSchema.access$000(IndexSchema.java:57) at
org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:330) at
org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:353) at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoad
er.java:140) at
org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:362) at
org.apache.solr.schema.IndexSchema.(IndexSchema.java:73) at
org.apache.solr.core.SolrCore.(SolrCore.java:275) at
org.apache.solr.core.SolrCore.getSolrCore(SolrCore.java:244) at
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:
68) at
org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFi
lterConfig.java:221) at
org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(Applicatio
nFilterConfig.java:302) at
org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilte
rConfig.java:78) at
org.apache.catalina.core.StandardContext.filterStart(StandardContext.jav
a:3635) at
org.apache.catalina.core.StandardContext.start(StandardContext.java:4222
) at
org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.ja
va:760) at
org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:740)
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:544)
at
org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:
626) at
org.apache.catalina.startup.HostConfig.deployDescriptors(HostConfig.java
:553) at
org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:488)
at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1138) at
org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:31
1) at
org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSu
pport.java:120) at
org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1022) at
org.apache.catalina.core.StandardHost.start(StandardHost.java:736) at
org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1014) at
org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443)
at
org.apache.catalina.core.StandardService.start(StandardService.java:448)
at
org.apache.catalina.core.StandardServer.start(StandardServer.java:700)
at org.apache.catalina.startup.Catalina.start(Catalina.java:552) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at
sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at
java.lang.reflect.Method.invoke(Unknown Source) at
org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:295) at
org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:433)
April 11th, 2008 at 6:11 pm
Hi!
I have successfully used this approach (ICU library):
http://glaforge.free.fr/weblog/index.php?itemid=115
Przemek
January 5th, 2012 at 2:56 pm
FilterFactoryspl chars ~`!@#$%^&*()_+{}|:;”‘?/
test message ¢ £