Article Sourcing in 2023
Article Sourcing is the process of collecting relevant articles from the Web about a topic, stakeholder, process, or institution. Sourced articles are later vetted and used in internal procedures like data extraction, data collection, entity extraction, compiling analysis, press-clipping, and many more. Many companies are doing this process manually. They have in-house stuff, or they […]
Article Sourcing is the process of collecting relevant articles from the Web about a topic, stakeholder, process, or institution. Sourced articles are later vetted and used in internal procedures like data extraction, data collection, entity extraction, compiling analysis, press-clipping, and many more.
Many companies are doing this process manually. They have in-house stuff, or they outsource to freelancers.
Some automated parts of the process, like article collection from specific sites, but they still have manual segments.
How do you collect articles?
Articles are collected using RSS feeds or by scraping sitemaps of relevant sites.
This collection process sounds effortless, but when you dive into it, you’ll notice many layers of complexity.
Almost 80% of sites now have some anti-bot protections, being Cloudflare or paid plugins for WordPress like Wordfence and others. They will block your IP address when they notice repetitive scraping of their sites.
When that happens new address is needed. After some time, protection will stop that new address, and you’ll be where you were before.
In this video, check out how Niched AI is scraping RSS feeds and how you can connect it with thousands of other applications.
Rotating IPs, aka Proxies
You understand that you need a way to switch between IP addresses constantly; thus, your tech stack is becoming more complex.
Not only do you need a pool of IPs, you need a collection of IPs with a good reputation. And based on your volume, it can get expensive fast.
Web Spiders
After implementing proxies, you now need to develop web spiders. Those scripts will visit specified sources regularly and pull the list of new articles.
It would be best if you had a way to orchestrate them and make them behave like visits from natural persons, not some bots.
When it comes to orchestration, you need a scheduling system with a queue and an approach to retry failed attempts.
Content extraction
Spiders are collecting only URLs of new articles, and you still need a way to extract meaningful content from those URLs.
Meaningful content means the article’s actual content, without strange HTML markup, adverts, and 3rd party embeds.
From experience, this is the most tricky part since every site has its unique structure and uses different HTML code.
For every site, you need a unique extraction script, and after the extraction process, you’ll usually need to run the content through modifier functions.
Connecting the puzzle pieces
Each of the described stages is a problem; when you combine them, you get a huge stack.
Monitoring that everything is working as expected is crucial. Sites tend to change their markup; thus, that’ll break extraction scripts.
The response time of 3rd party sites will vary, so timeout handling and retry logic are critical.
How can Niched AI help you with the Article Sourcing process?
Using an intuitive Niched AI user interface, you can easily add any number of custom sources. There are no limitations on what you can bring and from where we can pull articles.
Article Sourcing with Niched AI is simple; adding a new source to the project takes only a moment. After that, our system will handle everything for you.
Niched AI will control proxies, spiders, extraction logic, and monitoring.
You can focus on your business logic and leave the technical side of the process to Niched AI.
In Enterprise plans, you can define specific Search queries or topics, and Niched AI will monitor the entire Internet for you. You don’t need to add sources, just topics; the system will handle everything else.
Connecting everything with your workflow
Niched AI can connect with any number of 3rd party tools using our no-code integrations for Make/Integromat and Zapier.
Once Niched AI detects a new article that matches your project settings, the system will pass the content into your no-code workflow.