A look at how we use natural language processing to improve the speed and accuracy of our data acquisition processes.
By Sean Wang (github.com/shanglun), Full Stack Engineer at CB Insights.
Natural language processing — a technology that allows software applications to process human language — has become ubiquitous over the last few years. Google search is increasingly capable of answering natural-sounding questions like “how many calories are in 2,500 tons of butter?”Apple’s Siri is able to understand a wide variety of questions, and here at CB Insights, we use natural language processing to improve the speed and accuracy of our data acquisition processes.
Today, we will look at the technology behind these applications and develop a natural-language processing application of our own.
- Project Description
- Setting Up the Dependencies
- Writing the Analyzer
- Comparing the Portfolio Against the PropNouns Set
- Putting it All Together
- Conclusion
Project Description
We will build a news relevance analyzer that checks if a newspaper article mentions companies in our stock portfolio. This is a simplified version of the application that powers the “Related News” feature on the CB Insights platform. For example, if we are invested in Netflix, Microsoft, and Luxottica, we want the parser to identify if a given article mentions one or more of the companies.
We will be breaking down the project into several phases. We will first scrape the article from the web, extract the article body, and analyze the article’s grammatical structure. Then, we will take the result of the analysis and determine if the article mentions companies in our portfolio.
If you’d like to follow along, you can find a complete copy of the code used here.
Setting Up the Dependencies
We will use the Stanford NLP library to power the analyzer. Stanford NLP is a powerful library for natural language processing and supports many languages. As the Stanford NLP is a Java library, we will be using Java as our programming language. I will use Intellj IDEA as my editor, so feel free to make adjustments to the workflow as your editor requires.
Stanford NLP can be downloaded as a Maven dependency (if you’re not familiar with Maven, here is a quick guide). Simply add the following to your pom.xml and import the dependencies:
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.6.0</version>
</dependency>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.6.0</version>
<classifier>models</classifier>
</dependency>
We will also be using BoilerPipe to extract article content from articles, so let’s also add the following to the pom.xml file:
<dependency>
<groupId>de.l3s.boilerpipe</groupId>
<artifactId>boilerpipe</artifactId>
<version>1.1.0</version>
</dependency>
BoilerPipe’s HTML Fetcher uses NekoHTML, so let’s also add this to the pom.xml file:
<dependency>
<groupId>net.sourceforge.nekohtml</groupId>
<artifactId>nekohtml</artifactId>
<version>1.9.22</version>
</dependency>
Import the dependencies and we are good to go.
Writing the Analyzer
Writing the Scraper and Cleaner
Let’s begin by writing the article scraper part of the analyzer. The job of the scraper is to not only download the page content but to extract the main article body from the page content. We will be using BoilerPipe to help us with this task. When you download an article from a news source, it comes with a large amount of excess information. There will be videos, outbound links, and advertisements. BoilerPipe uses a simple yet powerful algorithm to remove the excess parts so that our application can focus on the relevant text.
BoilerPipe has built-in utilities for web scraping, downloading HTML from the web, and cleaning the downloaded text. Here, we define a function, extractFromURL, that takes a URL and returns the most relevant text as a string.
import java.net.URL;
import de.l3s.boilerpipe.document.TextDocument;
import de.l3s.boilerpipe.extractors.CommonExtractors;
import de.l3s.boilerpipe.sax.BoilerpipeSAXInput;
import de.l3s.boilerpipe.sax.HTMLDocument;
import de.l3s.boilerpipe.sax.HTMLFetcher;
public class BoilerPipeExtractor {
public static String extractFromUrl(String userUrl) throws java.io.IOException, org.xml.sax.SAXException, de.l3s.boilerpipe.BoilerpipeProcessingException {
final HTMLDocument htmlDoc = HTMLFetcher.fetch(new URL(userUrl));
final TextDocument doc = new BoilerpipeSAXInput(htmlDoc.toInputSource()).getTextDocument();
String content = CommonExtractors.ARTICLE_EXTRACTOR.getText(doc);
return content;
}
}
Let’s try it out with an example article regarding the mergers of optical giants Essilor and Luxottica, which you can find here. Feed this URL into the function and see what comes out.
public class App
{
public static void main( String[] args ) throws IOException,SAXException, BoilerpipeProcessingException
{
String urlString = “http://www.reuters.com/article/us-essilor-m-a-luxottica-group-idUSKBN14Z110”;
String text = BoilerPipeExtractor.extractFromUrl(urlString);
System.out.println(text);
}
}
In your output, you should see the main body of the article, without the ads, HTML tags, or outbound links.
Writing the Tagger
Now that we have the main body of the article, we can work on determining if the article mentions companies that are of interest to us. You may be tempted to do a string or regular expression search, but there are several challenges to this approach.
First, a string search may be prone to false positives. An article that mentions “Microsoft Excel” may be tagged as mentioning “Microsoft,” for instance. Second, depending on the construction of the regular expression, a regular expression search can lead to false negatives. For example, an article that contains the phrase “Luxottica’s quarterly earnings exceeded expectations” may be missed by a regular expression search that searches for “Luxottica” surrounded by white spaces. Finally, if we are interested in a large number of companies and are processing a large number of articles, searching through the entire body of text for every company in our portfolio may prove to be extremely time-consuming.
Stanford’s CoreNLP library can solve all three of these problems. The CoreNLP library has many powerful features, and for today’s analyzer we will use the Parts-of-Speech (POS) tagger. We can use the POS tagger to find all the proper nouns in the article and compare them to our portfolio of stocks. By incorporating NLP technology, we not only improve the accuracy of our tagger and minimize false positives and negatives, but we also dramatically minimize the amount of text we need to compare against our portfolio, since proper nouns are a much smaller subset of the original text. By pre-processing our portfolio into a data structure that has low membership query cost, we can dramatically reduce the time needed to analyze an article. The Stanford CoreNLP provides a convenient class called MaxentTagger that can provide POS Tagging in just a few lines of code. You can find the documentation here. We implement a class to use the MaxentTagger:
public class PortfolioNewsAnalyzer {
private HashSet<String> portfolio;
private static final String modelPath = “edustanfordnlpmodelspos-taggerenglish-left3wordsenglish-left3words-distsim.tagger”;
private MaxentTagger tagger;
public PortfolioNewsAnalyzer() {
tagger = new MaxentTagger(modelPath);
}
public String tagPos(String input) {
return tagger.tagString(input);
}
}
Our tagger function, tagPos, takes a string as an input and outputs a string that contains the words in the original string along with the part of speech attached. Feed the output of the scraper into the tagger function and see what comes out.
Change your main function to the following and run the program:
public static void main( String[] args ) throws
IOException,
SAXException,
BoilerpipeProcessingException
{
PortfolioNewsAnalyzer analyzer = new PortfolioNewsAnalyzer();
String urlString = “http://www.reuters.com/article/us-essilor-m-a-luxottica-group-idUSKBN14Z110”;
String text = BoilerPipeExtractor.extractFromUrl(urlString);
String tagged = analyzer.tagPos(text);
System.out.println(tagged);
}
You should see something like:
…Luxottica_NNP and_CC Essilor_NNP in_IN 46_CD billion_CD euro_NN merger_NN to_TO create_VB eyewear_NN giant_JJ …
Processing the Tagged Output into a Set
In the previous sections, we built functions to download, clean, and tag a newspaper article. We have the parts of speech of every word in the article. Now we want to compare the proper nouns against our investment portfolio.
The simple approach would be to extract all the proper nouns and compare the nouns with our investment portfolio. In some cases, such as Luxottica, this will work. However, when a company has a multi-word name, such as Carl Zeiss, this approach will fall short since it only looks at single words.
Therefore, we will want to build a system that tracks both individual nouns and potential noun phrases.
To find all the proper nouns, we will want to first split the tagged string output into tokens using space as the separator. Then, we will split each of the tokens on the underscore and check if the word is a proper noun (i.e. NNPs).
Once we have all the proper nouns, we will want to compare them against our investment portfolio. Consequently, we will be doing a potentially large number of membership checks. Therefore, we’ll use the HashSet to store the proper nouns. In exchange for disallowing duplicate entries and not keeping track of the order of the entries, HashSet allows fast membership queries.
Below is the function that implements the splitting and the storing of proper nouns and proper noun phrases. The code adds each proper noun to the HashSet. It also keeps a running list of adjacent proper nouns and adds them to the HashSet. Add the following code as a static function to the PortfolioNewsAnalyzer class:
public static HashSet<String> extractProperNouns(String taggedOutput) {
HashSet<String> propNounSet = new HashSet<String>();
String[] split = taggedOutput.split(” “);
List<String> propNounList = new ArrayList<String>();
for (String token: split ){
String[] splitToken = token.split(“_”);
if(splitToken[1].equals(“NNP”)){
propNounList.add(splitToken[0]);
} else {
if (!propNounList.isEmpty()) {
propNounSet.add(StringUtils.join(propNounList, ” “));
propNounList.clear();
}
}
}
//don’t forget to check for proper noun phrases at the end!
if (!propNounList.isEmpty()) {
propNounSet.add(StringUtils.join(propNounList, ” “));
propNounList.clear();
}
return propNounSet;
}
Now, the function should return a set with both the individual proper nouns and the consecutive proper nouns joined by a space.
Comparing the Portfolio Against the PropNouns Set
We are almost done! In the previous sections, we built a scraper that can download and extract the body of an article, a tagger that can parse the article body and identify proper nouns, and a processor that takes the tagged output and collects the proper nouns into a HashSet. Now all that’s left to do is to take the hash set and compare it with the list of companies that we’re interested in.
The implementation is now simple. Add the following code to your PortfolioNewsAnalyzer class:
private HashSet<String> portfolio;
public PortfolioNewsAnalyzer(){
//constructor should already be there, so just add the line below
this.portfolio = new HashSet<String>();
tagger = new MaxentTagger(modelPath);
}
public void addPortfolioCompany(String company) {
this.portfolio.add(company);
}
public boolean arePortfolioCompaniesMentioned(HashSet<String> articleProperNouns){
return !Collections.disjoint(articleProperNouns, this.portfolio);
}
Putting It All Together
Now we can run the entire application — scraping, cleaning, tagging, collecting, and comparing. Below is the function that calls the entire application, which you can add to the PortfolioNewsAnalyzer class:
public boolean analyzeArticle(String urlString) throws
IOException,
SAXException,
BoilerpipeProcessingException
{
String articleText = this.extractFromUrl(urlString);
String tagged = this.tagPos(articleText);
HashSet<String> properNounsSet = this.extractProperNouns(tagged);
return this.arePortfolioCompaniesMentioned(properNounsSet);
}
Finally, we can use the application. Here is an example using the same article as above and Luxottica as the portfolio company:
public static void main( String[] args ) throws
IOException,
sax.SAXException,
BoilerpipeProcessingException
{
PortfolioNewsAnalyzer analyzer = new PortfolioNewsAnalyzer();
analyzer.addPortfolioCompany(“Luxottica”);
boolean mentioned = analyzer.analyzeArticle(“http://www.reuters.com/article/us-essilor-m-a-luxottica-group-idUSKBN14Z110”);
if (mentioned) {
System.out.println(“Article mentions portfolio companies”);
} else {
System.out.println(“Article does not mention portfolio companies”);
}
}
Run this and the application should print “Article mentions portfolio companies.” Change the portfolio company from Luxottica to an unrelated company such as “Microsoft” and the application should print “Article does not mention portfolio companies.”
Conclusion
In today’s article, we built an application that downloads an article from an url, cleans it using BoilerPipe, processes it using Stanford NLP, and checks if the article mentions companies in our portfolio. I hope this article introduced useful concepts in natural language processing and that it inspired you to write natural language applications of your own.
Once again, you can find a copy of the code used here.
P.S: We’re hiring! If you’re an engineer who enjoys solving tough data problems, drop us a line at https://www.cbinsights.com/jobs