Search tokenization and language support

Posts Table Pro uses an indexed search system to allow searching all content in the table and to make table searches faster. When content is indexed, the plugin normalizes the text, extracts special patterns such as numbers and dates, and then splits the remaining content on whitespace and punctuation.

This works well for many languages, but it does not include dedicated word segmentation, stemming, language-specific tokenization, or n-gram tokenization by default.

This article is intended for developers. Customizing the search tokenizer is advanced usage of Posts Table Pro and is not covered by our standard plugin support.

Languages that may need custom tokenization

Some languages may need custom parsing to get the best search results:

Chinese, Japanese, and Korean may need word segmentation or n-grams.
Thai, Lao, Khmer, and Myanmar text often has limited whitespace between words.
Languages with complex word forms may benefit from language-specific stemming or normalization.

For languages without reliable spaces between words, a complete phrase or clause may be stored as one long token until the parser reaches punctuation. On a large site, this can create many long, near-unique rows in the ptp_dv_tokens database table.

Developer filters

Advanced developers can use the following filters to adjust how Posts Table Pro parses indexed search content:

posts_table_search_parsed_tokens - replace or adjust the final token count array.
posts_table_search_min_token_length - change the minimum indexed token length.
posts_table_search_max_token_length - change the maximum indexed token length.
posts_table_search_strict_matching - disable accent-folded token rows.
posts_table_search_special_token_patterns - change special token extraction patterns, such as numbers, dates, and hyphenated terms.

Tokenizer customizations must be active both when indexing content and when processing search terms. After changing tokenizer filters, rebuild the search index before testing results.

You can find the rebuild link in the Posts Table Pro information section of your WordPress Site Health page.

Experimental n-gram example

Posts Table Pro does not currently provide supported n-gram tokenization. The example below is guidance only for developers who want to experiment on their own site.

This example replaces tokens containing Han characters with two-character bigrams. Bigrams can improve substring matching for Chinese content, but they can also increase ptp_dv_search_index rows and change search relevance. Test this on a staging site before using it on a large production site.

<add_filter(
	'posts_table_search_parsed_tokens',
	function ( $tokens, $content ) {
		if ( ! preg_match( '/\p{Script=Han}/u', $content ) ) {
			return $tokens;
		}

		$filtered = [];

		foreach ( $tokens as $token => $count ) {
			if ( is_string( $token ) && ! preg_match( '/\p{Script=Han}/u', $token ) ) {
				$filtered[ $token ] = $count;
			}
		}

		preg_match_all( '/\p{Script=Han}+/u', $content, $matches );

		foreach ( $matches[0] as $segment ) {
			$chars = preg_split( '//u', $segment, -1, PREG_SPLIT_NO_EMPTY );

			for ( $i = 0, $len = count( $chars ); $i < $len - 1; $i++ ) {
				$token = $chars[ $i ] . $chars[ $i + 1 ];
				$filtered[ $token ] = ( $filtered[ $token ] ?? 0 ) + 1;
			}
		}

		return $filtered;
	},
	10,
	2
);

After adding custom tokenization

After adding or changing tokenizer code:

Rebuild the Posts Table Pro search index.
Confirm that ptp_dv_tokens contains the expected custom tokens instead of long language-specific clauses.
Test representative search terms and confirm they return the expected posts.
Compare token counts, search index size, and search performance before and after the change.

Still need help?

If searching the knowledge base hasn't answered your question, please contact support.

Get Support

Search tokenization and language support

Languages that may need custom tokenization

Developer filters

Experimental n-gram example

After adding custom tokenization

Related Articles