Optimized Focused Web Crawler with Natural Language Processing Based Relevance Measure in Bioinformatics Web Sources

In the fast growing of digital technologies, crawlers and search engines face unpredictable challenges. Focused web-crawlers are essential for mining the boundless data available on the internet. Web-Crawlers face indeterminate latency problem due to differences in their response time. The proposed work attempts to optimize the designing and implementation of Focused Web-Crawlers using Master-Slave architecture for Bioinformatics web sources. Focused Crawlers ideally should crawl only relevant pages, but the relevance of the page can only be estimated after crawling the genomics pages. A solution for predicting the page relevance, which is based on Natural Language Processing, is proposed in the paper. The frequency of the keywords on the top ranked sentences of the page determines the relevance of the pages within genomics sources. The proposed solution uses a TextRank algorithm to rank the sentences, as well as ensuring the correct classification of Bioinformatics web page. Finally, the model is validated by being compared with a breadth first search web-crawler. The comparison shows significant reduction in run time for the same harvest rate.

eISSN:: 1314-4081
Language:: English

Publication timeframe:: 4 times per year
Journal Subjects:: Computer Sciences, Information Technology

Journal RSS Feed

Optimized Focused Web Crawler with Natural Language Processing Based Relevance Measure in Bioinformatics Web Sources

Published Online: Jun 18, 2019

Page range: 146 - 158

Received: Feb 22, 2018

Accepted: Feb 14, 2019

DOI: https://doi.org/10.2478/cait-2019-0021

Keywords
Focused crawler, Data extraction, Natural Llanguage Processing, Topical Crawler, TextRank, Distributed Crawler, Master-Slave architecture, Bioinformatics

© 2019 S. R. Mani Sekhar et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.

Optimized Focused Web Crawler with Natural Language Processing Based Relevance Measure in Bioinformatics Web Sources

Published Online: Jun 18, 2019

Page range: 146 - 158

Received: Feb 22, 2018

Accepted: Feb 14, 2019

DOI: https://doi.org/10.2478/cait-2019-0021

KeywordsFocused crawler, Data extraction, Natural Llanguage Processing, Topical Crawler, TextRank, Distributed Crawler, Master-Slave architecture, Bioinformatics

© 2019 S. R. Mani Sekhar et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.

Keywords
Focused crawler, Data extraction, Natural Llanguage Processing, Topical Crawler, TextRank, Distributed Crawler, Master-Slave architecture, Bioinformatics