Looking for software experts?
Need an expert advice on software development? Need consulting work done in time and at high standards? Tremend has the right solution for you.

We can provide expertise in:
  •    » high traffic and complex content website infrastructures using Java, PHP or .NET. More here ...
  •    » mobile applications for iPhone, Android or J2ME. More here ...

For an enquiry, send an email to contact [at] tremend [dot] ro.

Create a Solr filter that replaces diacritics

August 28th, 2007 by Sebastian Mitroi in Java, General

Some languages (like Romanian) have special characters (diacritics, often called accent marks). It’s generally useful to remove diacritic marks from characters, for example when you create an index with Solr. You don’t want to index text with these characters because you want to find for example both words “proprietăţi” and “proprietati”. If you are using Solr to index your text you have to create a Solr filter.
First of all you have to put the filter in the schema.xml configuration file :


<fieldtype name="text_st" class="solr.TextField" positionIncrementGap="100">
            <analyzer>
                <tokenizer class="solr.StandardTokenizerFactory"/>
                // ... some other filters for example lower case filter
                <filter class="solr.LowerCaseFilterFactory"/>   
                <filter class="ro.tremend.solr.diacritics.DiacriticsFilterFactory"/>

            </analyzer>
</fieldtype>

Then create 3 small classes and a properties file. The filter factory for Solr DiacriticsFilterFactory :

package ro.tremend.solr.diacritics;

import org.apache.lucene.analysis.TokenStream;
import org.apache.solr.analysis.BaseTokenFilterFactory;

/**
 * Create a Solr Filter Factory for diacritics
 *
 * @author Sebastian
 *
 */
public class DiacriticsFilterFactory extends BaseTokenFilterFactory {
	public TokenStream create(TokenStream input) {
		return new DiacriticsFilter(input);
	}
}

Now you have to create the filter class DiacriticsFilter :

package ro.tremend.solr.diacritics;

import org.apache.lucene.analysis.*;
import java.io.IOException;

/**
 * Create the diacritics filter
 *
 * @author Sebastian
 *
 */
public final class DiacriticsFilter extends TokenFilter {
	public DiacriticsFilter(TokenStream in) {
		super(in);
	}

	public final Token next() throws IOException {
		Token t = input.next();

		if (t == null)
			return null;

		t.setTermText(DiacriticsUtils.replaceDiacritics(t.termText()));
		return t;
	}
}

and finally the class that does the work DiacriticsUtils :

package ro.tremend.solr.diacritics;

import java.util.HashMap;
import java.util.Map;
import java.util.MissingResourceException;
import java.util.ResourceBundle;
import java.util.Set;

/**
 * Replace romanian characters
 *
 * @author Sebastian
 *
 */
public class DiacriticsUtils {
	private static Map diacritics = new HashMap();

	static {
		// Get diacritics from diacritics.properties
		try {
			ResourceBundle resource = ResourceBundle.getBundle("diacritics");
			Set keySet = resource.keySet();
			for (String key : keySet) {
				diacritics.put(key, resource.getString(key));
			}
		} catch (MissingResourceException e) {
			e.printStackTrace();
		}
	}

	/**
	 * Replace all diacritics in a string
	 *
	 * @param s the string
	 * @return the string without diacritics
	 */
	public static String replaceDiacritics(String s) {
		for (String key : diacritics.keySet()) {
			s = s.replaceAll(key, diacritics.get(key));
		}
		return s;
	}

	public static Map getDiacritics() {
		return diacritics;
	}
}

This class needs a properties file with the diacritics you want to replace:
diacritics.properties

\\u0102=A
\\u0103=a
... define all your language specific characters


Now the index will not contain diacritics, but you have to remove the diacritics from the query too. To do that just write this:

textToFind = DiacriticsUtils.replaceDiacritics(textToFind);


I hope this will help.

DZoneGoogle ReaderYahoo MessengerRedditEmailDelicious

Related posts

.

8 Responses

  1. Hoss Says:

    Unles you have a very specific need to define a custom list of diacritics to remove, the ISOLatin1AccentFilter that comes with Lucene (and the ISOLatin1AccentFilterFactory that comes with Solr) should solve this problem for you without any custom code.

    (ISOLatin1AccentFilter has had a lot of speed improvements in the trunk which should make it into the next version of Solr as well … see LUCENE-871 in apache Jira for more info)

  2. chetan Says:

    Hello

    Great example
    I am new to solr and trying to understand as to where i should be applying this last update mentioned
    —————————————————————
    Now the index will not contain diacritics, but you have to remove the diacritics from the query too. To do that just write this:

    textToFind = DiacriticsUtils.replaceDiacritics(textToFind);
    ————————————————————–

    I am building an application where I would need to replace an incoming string with another specific to our taxonmy
    Please advice

  3. Catalin Says:

    Even nicer would be to go beyond this Accents filter and to also use the romanian stemmer filter from Snowball.

    For example you search for:
    masini (romanian word for cars)
    and you also get results for:
    masina (romanian word for car)

    Example:
    http://www.egirl.ro/search?query=masini

    Keep up the good work.

  4. Alina Says:

    You can also replace diacritics with the base characters using java.text.Normalizer.
    This is useful when you don’t know which diacritics can appear. This code will always extract the base character.

    This is the code:
    public String toBase(String sText){
    boolean bChar = true;
    int iSize = sText.length(), i = 0;
    String sAux = “”;

    for (i=0; i < iSize; i++){
    String sLetter = new String(new char[]{sText.charAt(i)});

    sLetter = Normalizer.normalize(sLetter, Normalizer.Form.NFD);

    try{
    byte[] bLetter = (new String(sLetter)).getBytes(“UTF-8″);
    char cLetter = (char) bLetter[0];
    sAux += “” + cLetter;
    }
    catch(Exception e){
    //do something
    }
    }
    return sAux;
    }

  5. Sebastian Says:

    thanks for suggestion.

  6. heba Says:

    I’m trying to add a factory in solr for tokenizing Arabic text, but I
    receive some error (the one at the last of my email). can u help me solve this problem please.

    Here is my code:

    package org.apache.solr.analysis;

    import gpl.pierrick.brihaye.aramorph.lucene.ArabicTokenizer;
    import java.io.Reader;
    import org.apache.lucene.analysis.TokenStream;
    import org.apache.solr.analysis.BaseTokenizerFactory;

    public class ArabicTokenizerFactory extends BaseTokenizerFactory{
    public TokenStream create(Reader input) {
    return new ArabicTokenizer(input);
    }
    }

    Thanks in advance

    HTTP Status 500 – Severe errors in solr configuration. Check your log
    files for more detailed information on what may be wrong. If you want
    solr to continue after configuration errors, change:
    false in
    solrconfig.xml
    ————————————————————-
    java.lang.VerifyError: (class:
    org/apache/solr/analysis/ArabicTokenizerFactory, method: create
    signature: (Ljava/io/Reader;)Lorg/apache/lucene/analysis/TokenStream;)
    Wrong return type in function at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Unknown Source) at
    org.apache.solr.core.Config.findClass(Config.java:308) at
    org.apache.solr.core.Config.newInstance(Config.java:319) at
    org.apache.solr.schema.IndexSchema.readTokenizerFactory(IndexSchema.java
    :631) at
    org.apache.solr.schema.IndexSchema.readAnalyzer(IndexSchema.java:605) at
    org.apache.solr.schema.IndexSchema.access$000(IndexSchema.java:57) at
    org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:330) at
    org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:353) at
    org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoad
    er.java:140) at
    org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:362) at
    org.apache.solr.schema.IndexSchema.(IndexSchema.java:73) at
    org.apache.solr.core.SolrCore.(SolrCore.java:275) at
    org.apache.solr.core.SolrCore.getSolrCore(SolrCore.java:244) at
    org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:
    68) at
    org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFi
    lterConfig.java:221) at
    org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(Applicatio
    nFilterConfig.java:302) at
    org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilte
    rConfig.java:78) at
    org.apache.catalina.core.StandardContext.filterStart(StandardContext.jav
    a:3635) at
    org.apache.catalina.core.StandardContext.start(StandardContext.java:4222
    ) at
    org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.ja
    va:760) at
    org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:740)
    at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:544)
    at
    org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:
    626) at
    org.apache.catalina.startup.HostConfig.deployDescriptors(HostConfig.java
    :553) at
    org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:488)
    at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1138) at
    org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:31
    1) at
    org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSu
    pport.java:120) at
    org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1022) at
    org.apache.catalina.core.StandardHost.start(StandardHost.java:736) at
    org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1014) at
    org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443)
    at
    org.apache.catalina.core.StandardService.start(StandardService.java:448)
    at
    org.apache.catalina.core.StandardServer.start(StandardServer.java:700)
    at org.apache.catalina.startup.Catalina.start(Catalina.java:552) at
    sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
    sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at
    java.lang.reflect.Method.invoke(Unknown Source) at
    org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:295) at
    org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:433)

  7. PeWu Says:

    Hi!

    I have successfully used this approach (ICU library):
    http://glaforge.free.fr/weblog/index.php?itemid=115

    Przemek

  8. spl chars ~`!@#$%^&*()_+{}|:;"'?/ Says:

    FilterFactoryspl chars ~`!@#$%^&*()_+{}|:;”‘?/

    test message ¢ £

Leave a Comment

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.