OpenNLP-文検出

自然言語の処理中に、文の開始と終了を決定することは、対処すべき問題の1つです。このプロセスは、* S' 文章 B *音響'曖昧さ回避（SBD）または単に文の区切りとして知られています。

特定のテキスト内の文を検出するために使用する手法は、テキストの言語によって異なります。

Javaを使用した文検出

正規表現と一連の単純なルールを使用して、Javaで指定されたテキスト内の文を検出できます。

たとえば、ピリオド、疑問符、または感嘆符が指定されたテキストの文を終了すると仮定し、 String クラスの* split（）*メソッドを使用して文を分割できます。ここでは、文字列形式の正規表現を渡す必要があります。

以下は、Java正規表現*（split method）を使用して、指定されたテキスト内の文を決定するプログラムです。このプログラムを *SentenceDetection_RE.java という名前のファイルに保存します。

public class SentenceDetection_RE {
   public static void main(String args[]){

      String sentence = " Hi. How are you? Welcome to finddevguides. "
         + "We provide free tutorials on various technologies";

      String simple = "[.?!]";
      String[] splitString = (sentence.split(simple));
      for (String string : splitString)
         System.out.println(string);
   }
}

次のコマンドを使用して、コマンドプロンプトから保存したJavaファイルをコンパイルして実行します。

javac SentenceDetection_RE.java
java SentenceDetection_RE

実行すると、上記のプログラムは、次のメッセージを表示するPDFドキュメントを作成します。

Hi
How are you
Welcome to finddevguides
We provide free tutorials on various technologies

OpenNLPを使用した文検出

文を検出するために、OpenNLPは事前定義モデル、 en-sent.bin という名前のファイルを使用します。この事前定義モデルは、指定された生テキストの文を検出するようにトレーニングされています。

*opennlp.tools.sentdetect* パッケージには、文検出タスクの実行に使用されるクラスとインターフェースが含まれています。

OpenNLPライブラリを使用して文を検出するには、する必要があります-

SentenceModel クラスを使用して en-sent.bin モデルを読み込みます
SentenceDetectorME クラスをインスタンス化します。
このクラスの* sentDetect（）*メソッドを使用して文を検出します。

以下は、与えられた生のテキストから文を検出するプログラムを書くために従うべきステップです。

ステップ1：モデルの読み込み

文検出のモデルは、パッケージ opennlp.tools.sentdetect に属する SentenceModel というクラスで表されます。

文検出モデルをロードするには-

モデルの InputStream オブジェクトを作成します（FileInputStreamをインスタンス化し、モデルのパスをString形式でコンストラクターに渡します）。
SentenceModel クラスをインスタンス化し、次のコードブロックに示すように、コンストラクターへのパラメーターとしてモデルの InputStream （オブジェクト）を渡します-

//Loading sentence detector model
InputStream inputStream = new FileInputStream("C:/OpenNLP_models/ensent.bin");
SentenceModel model = new SentenceModel(inputStream);

ステップ2：SentenceDetectorMEクラスのインスタンス化

パッケージ opennlp.tools.sentdetect の SentenceDetectorME クラスには、生のテキストを文に分割するメソッドが含まれています。このクラスは、最大エントロピーモデルを使用して、文字列内の文末文字を評価し、文末を示すかどうかを判断します。

以下に示すように、このクラスをインスタンス化し、前の手順で作成したモデルオブジェクトを渡します。

//Instantiating the SentenceDetectorME class
SentenceDetectorME detector = new SentenceDetectorME(model);

ステップ3：文を検出する

*SentenceDetectorME* クラスの* sentDetect（）*メソッドを使用して、渡された生テキスト内の文を検出します。 このメソッドは、パラメーターとしてString変数を受け入れます。

文の文字列形式をこのメソッドに渡すことにより、このメソッドを呼び出します。

//Detecting the sentence
String sentences[] = detector.sentDetect(sentence);

例

以下は、指定された生のテキスト内の文を検出するプログラムです。このプログラムを SentenceDetectionME.java という名前のファイルに保存します。

import java.io.FileInputStream;
import java.io.InputStream;

import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;

public class SentenceDetectionME {

   public static void main(String args[]) throws Exception {

      String sentence = "Hi. How are you? Welcome to finddevguides. "
         + "We provide free tutorials on various technologies";

     //Loading sentence detector model
      InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-sent.bin");
      SentenceModel model = new SentenceModel(inputStream);

     //Instantiating the SentenceDetectorME class
      SentenceDetectorME detector = new SentenceDetectorME(model);

     //Detecting the sentence
      String sentences[] = detector.sentDetect(sentence);

     //Printing the sentences
      for(String sent : sentences)
         System.out.println(sent);
   }
}

次のコマンドを使用して、コマンドプロンプトから保存したJavaファイルをコンパイルして実行します-

javac SentenceDetectorME.java
java SentenceDetectorME

実行時に、上記のプログラムは指定された文字列を読み取り、その中の文を検出し、次の出力を表示します。

Hi. How are you?
Welcome to finddevguides.
We provide free tutorials on various technologies

文の位置を検出する

SentenceDetectorMEクラス*のsentPosDetect（）メソッドを使用して、文の位置を検出することもできます。

以下は、与えられた生のテキストから文の位置を検出するプログラムを書くために従うべきステップです。

ステップ1：モデルの読み込み

文検出のモデルは、パッケージ opennlp.tools.sentdetect に属する SentenceModel というクラスで表されます。

文検出モデルをロードするには-

モデルの InputStream オブジェクトを作成します（FileInputStreamをインスタンス化し、モデルのパスをString形式でコンストラクターに渡します）。
次のコードブロックに示すように、 SentenceModel クラスをインスタンス化し、モデルの InputStream （オブジェクト）をコンストラクターにパラメーターとして渡します。

//Loading sentence detector model
InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-sent.bin");
SentenceModel model = new SentenceModel(inputStream);

ステップ2：SentenceDetectorMEクラスのインスタンス化

パッケージ opennlp.tools.sentdetect の SentenceDetectorME クラスには、生のテキストを文に分割するメソッドが含まれています。このクラスは、最大エントロピーモデルを使用して、文字列内の文末文字を評価し、文末を示すかどうかを判断します。

このクラスをインスタンス化し、前の手順で作成したモデルオブジェクトを渡します。

//Instantiating the SentenceDetectorME class
SentenceDetectorME detector = new SentenceDetectorME(model);

ステップ3：文の位置を検出する

*SentenceDetectorME* クラスの* sentPosDetect（）*メソッドは、渡された未加工テキスト内の文の位置を検出するために使用されます。 このメソッドは、パラメーターとしてString変数を受け入れます。

文の文字列形式をパラメータとしてこのメソッドに渡すことにより、このメソッドを呼び出します。

//Detecting the position of the sentences in the paragraph
Span[] spans = detector.sentPosDetect(sentence);

ステップ4：文の範囲を印刷する

*SentenceDetectorME* クラスの* sentPosDetect（）*メソッドは、 *Span* 型のオブジェクトの配列を返します。 *opennlp.tools.util* パッケージのSpanという名前のクラスは、セットの開始および終了整数を格納するために使用されます。

次のコードブロックに示すように、* sentPosDetect（）*メソッドによって返されたスパンをSpan配列に格納して印刷できます。

//Printing the sentences and their spans of a sentence
for (Span span : spans)
System.out.println(paragraph.substring(span);

例

以下は、指定された生テキスト内の文を検出するプログラムです。このプログラムを SentenceDetectionME.java という名前のファイルに保存します。

import java.io.FileInputStream;
import java.io.InputStream;

import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
import opennlp.tools.util.Span;

public class SentencePosDetection {

   public static void main(String args[]) throws Exception {

      String paragraph = "Hi. How are you? Welcome to finddevguides. "
         + "We provide free tutorials on various technologies";

     //Loading sentence detector model
      InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-sent.bin");
      SentenceModel model = new SentenceModel(inputStream);

     //Instantiating the SentenceDetectorME class
      SentenceDetectorME detector = new SentenceDetectorME(model);

     //Detecting the position of the sentences in the raw text
      Span spans[] = detector.sentPosDetect(paragraph);

     //Printing the spans of the sentences in the paragraph
      for (Span span : spans)
         System.out.println(span);
   }
}

次のコマンドを使用して、コマンドプロンプトから保存したJavaファイルをコンパイルして実行します-

javac SentencePosDetection.java
java SentencePosDetection

実行時に、上記のプログラムは指定された文字列を読み取り、その中の文を検出し、次の出力を表示します。

[0..16)
[17..43)
[44..93)

文章とその位置

Stringクラスの* substring（）メソッドは、 *begin および end offsets を受け入れ、それぞれの文字列を返します。次のコードブロックに示すように、このメソッドを使用して、文とそのスパン（位置）を一緒に印刷できます。

for (Span span : spans)
   System.out.println(sen.substring(span.getStart(), span.getEnd())+" "+ span);

以下は、与えられた生のテキストから文を検出し、それらの位置とともにそれらを表示するプログラムです。このプログラムを SentencesAndPosDetection.java という名前のファイルに保存します。

import java.io.FileInputStream;
import java.io.InputStream;

import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
import opennlp.tools.util.Span;

public class SentencesAndPosDetection {

   public static void main(String args[]) throws Exception {

      String sen = "Hi. How are you? Welcome to finddevguides."
         + " We provide free tutorials on various technologies";
     //Loading a sentence model
      InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-sent.bin");
      SentenceModel model = new SentenceModel(inputStream);

     //Instantiating the SentenceDetectorME class
      SentenceDetectorME detector = new SentenceDetectorME(model);

     //Detecting the position of the sentences in the paragraph
      Span[] spans = detector.sentPosDetect(sen);

     //Printing the sentences and their spans of a paragraph
      for (Span span : spans)
         System.out.println(sen.substring(span.getStart(), span.getEnd())+" "+ span);
   }
}

次のコマンドを使用して、コマンドプロンプトから保存したJavaファイルをコンパイルして実行します-

javac SentencesAndPosDetection.java
java SentencesAndPosDetection

実行時に、上記のプログラムは指定された文字列を読み取り、文をその位置とともに検出し、次の出力を表示します。

Hi. How are you? [0..16)
Welcome to finddevguides. [17..43)
We provide free tutorials on various technologies [44..93)

文の確率の検出

*SentenceDetectorME* クラスの* getSentenceProbabilities（）*メソッドは、sendDetect（）メソッドの最新の呼び出しに関連付けられた確率を返します。

//Getting the probabilities of the last decoded sequence
double[] probs = detector.getSentenceProbabilities();

以下は、sentDetect（）メソッドの呼び出しに関連する確率を出力するプログラムです。このプログラムを SentenceDetectionMEProbs.java という名前のファイルに保存します。

import java.io.FileInputStream;
import java.io.InputStream;

import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;

public class SentenceDetectionMEProbs {

   public static void main(String args[]) throws Exception {

      String sentence = "Hi. How are you? Welcome to finddevguides. "
         + "We provide free tutorials on various technologies";

     //Loading sentence detector model
      InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-sent.bin");
      SentenceModel model = new SentenceModel(inputStream);

     //Instantiating the SentenceDetectorME class
      SentenceDetectorME detector = new SentenceDetectorME(model);

     //Detecting the sentence
      String sentences[] = detector.sentDetect(sentence);

     //Printing the sentences
      for(String sent : sentences)
         System.out.println(sent);

     //Getting the probabilities of the last decoded sequence
      double[] probs = detector.getSentenceProbabilities();

      System.out.println("  ");

      for(int i = 0; i<probs.length; i++)
         System.out.println(probs[i]);
   }
}

次のコマンドを使用して、コマンドプロンプトから保存したJavaファイルをコンパイルして実行します-

javac SentenceDetectionMEProbs.java
java SentenceDetectionMEProbs

実行すると、上記のプログラムは指定された文字列を読み取り、文を検出して出力します。さらに、以下に示すように、sentDetect（）メソッドの最新の呼び出しに関連付けられた確率も返します。

Hi. How are you?
Welcome to finddevguides.
We provide free tutorials on various technologies

0.9240246995179983
0.9957680129995953
1.0

Opennlp-sentence-detection

OpenNLP-文検出

Javaを使用した文検出

OpenNLPを使用した文検出

ステップ1：モデルの読み込み

ステップ2：SentenceDetectorMEクラスのインスタンス化

ステップ3：文を検出する

文の位置を検出する

ステップ1：モデルの読み込み

ステップ2：SentenceDetectorMEクラスのインスタンス化

ステップ3：文の位置を検出する

ステップ4：文の範囲を印刷する

文章とその位置

文の確率の検出

目次