Create a Solr filter that replaces diacritics

3 min read >

Create a Solr filter that replaces diacritics

Engineering Insights & Enterprise solutions

Some languages (like Romanian) have special characters (diacritics, often called accent marks). It’s generally useful to remove diacritic marks from characters, for example when you create an index with Solr. You don’t want to index text with these characters because you want to find for example both words “propriet??i” and “proprietati”. If you are using Solr to index your text you have to create a Solr filter.
First of all, you have to put the filter in the schema.xml configuration file :


<fieldtype name="text_st" class="solr.TextField" positionIncrementGap="100">
            <analyzer>
                <tokenizer class="solr.StandardTokenizerFactory"/>
                // ... some other filters for example lower case filter
                <filter class="solr.LowerCaseFilterFactory"/>    <filter class="ro.tremend.solr.diacritics.DiacriticsFilterFactory"/> 
            </analyzer>
</fieldtype>

Then create 3 small classes and a properties file. The filter factory for Solr DiacriticsFilterFactory :

package ro.tremend.solr.diacritics;

import org.apache.lucene.analysis.TokenStream;
import org.apache.solr.analysis.BaseTokenFilterFactory;

/**
 * Create a Solr Filter Factory for diacritics
 * 
 * @author Sebastian
 * 
 */
public class DiacriticsFilterFactory extends BaseTokenFilterFactory {
	public TokenStream create(TokenStream input) {
		return new DiacriticsFilter(input);
	}
}

Now you have to create the filter class DiacriticsFilter :

package ro.tremend.solr.diacritics;

import org.apache.lucene.analysis.*;
import java.io.IOException;

/**
 * Create the diacritics filter
 * 
 * @author Sebastian
 * 
 */
public final class DiacriticsFilter extends TokenFilter {
	public DiacriticsFilter(TokenStream in) {
		super(in);
	}

	public final Token next() throws IOException {
		Token t = input.next();

		if (t == null)
			return null;

		t.setTermText(DiacriticsUtils.replaceDiacritics(t.termText()));
		return t;
	}
}

and finally, the class that does the work DiacriticsUtils :

package ro.tremend.solr.diacritics;

import java.util.HashMap;
import java.util.Map;
import java.util.MissingResourceException;
import java.util.ResourceBundle;
import java.util.Set;

/**
 * Replace romanian characters
 * 
 * @author Sebastian
 * 
 */
public class DiacriticsUtils {
	private static Map<String, String> diacritics = new HashMap<String, String>();

	static {
		// Get diacritics from diacritics.properties
		try {
			ResourceBundle resource = ResourceBundle.getBundle("diacritics");
			Set keySet = resource.keySet();
			for (String key : keySet) {
				diacritics.put(key, resource.getString(key));
			}
		} catch (MissingResourceException e) {
			e.printStackTrace();
		}
	}

	/**
	 * Replace all diacritics in a string
	 * 
	 * @param s the string
	 * @return the string without diacritics
	 */
	public static String replaceDiacritics(String s) {
		for (String key : diacritics.keySet()) {
			s = s.replaceAll(key, diacritics.get(key));
		}
		return s;
	}

	public static Map<String, String> getDiacritics() {
		return diacritics;
	}
}

This class needs properties to file with the diacritics you want to replace:
diacritics.properties

\u0102=A
\u0103=a
... define all your language specific characters


Now the index will not contain diacritics, but you have to remove the diacritics from the query too. To do that just write this:

textToFind = DiacriticsUtils.replaceDiacritics(textToFind);


I hope this will help.