public interface IFieldExtractor
Provides methods for extracting fields from a document.
Learn more
The example demonstrates how to implement the interface.
public class LogExtractor implements IFieldExtractor
{
private final String[] extensions = new String[] { ".log" };
public final String[] getExtensions() { return extensions; }
public final DocumentField[] getFields(String filePath)
{
File file = new File(filePath);
DocumentField[] fields = new DocumentField[]
{
new DocumentField("FileName", file.getAbsolutePath()),
new DocumentField("Content", extractContent(filePath)),
};
return fields;
}
private String extractContent(String filePath)
{
StringBuilder result = new StringBuilder();
try {
List<String> lines = Files.readAllLines(Paths.get(filePath), StandardCharsets.UTF_8);
for (int i = 0; i < lines.size(); i++)
{
String line = lines.get(i);
String processedLine = line.substring(12);
result.append(processedLine);
}
} catch (IOException ex) {
throw new RuntimeException(ex);
}
return result.toString();
}
}
The example demonstrates how to use the custorm extractor for indexing.
String indexFolder = "c:\\MyIndex\\"; // Specify path to the index folder
String documentsFolder = "c:\\MyDocuments\\"; // Specify path to a folder containing documents to search
Index index = new Index(indexFolder); // Creating or loading an index
index.getIndexSettings().getCustomExtractors().addItem(new LogExtractor()); // Adding custom text extractor to index settings
index.add(documentsFolder); // Indexing documents from the specified folder
Modifier and Type | Method and Description |
---|---|
String[] |
getExtensions()
Gets the supported extensions.
|
DocumentField[] |
getFields(InputStream stream)
Extracts all fields from the specified document.
|
DocumentField[] |
getFields(String filePath)
Extracts all fields from the specified document.
|
String[] getExtensions()
Gets the supported extensions.
DocumentField[] getFields(String filePath)
Extracts all fields from the specified document.
filePath
- The document file path.DocumentField[] getFields(InputStream stream)
Extracts all fields from the specified document.
stream
- The document stream.