Technologies:
Tolerim
a month ago
My task is to write Node.js code that extracts the data in MCA.
A nodeJS developer is required to complete the following task: extract daily registered MCA company names from Ministry of Corporate Affairs records. The developer should obtain information such as the name of the company, director, phone, email, and address, and sort records based on proximity to Raipur. The output should be emailed automatically on a daily basis. To complete this task, a NodeJS script needs to be written that scrapes MCA records for newly registered companies every day. The script should extract necessary details for every registered company such as company name, director name, phone number, email address, and physical address. The physical address of each registered company should be calculated using a geocoding library, and the distance from Raipur should be used to sort the data extracted. The results should be stored in a CSV file or database, along with the date and time of registration. The script should be run every day automatically using a cron job and an email with the data should be sent to a specified email address after each daily scrape. The script should handle any errors encountered during execution and record them in a separate file. Other features such as data deduplication and irrelevant data filtering may be implemented as well. Useful libraries for this task include Requests, BeautifulSoup, Geopy, Panda, and SMTP.
Answers(2)
Vogi
a month ago
Verified Answer
It sounds like you need a NodeJS developer to write a script that scrapes MCA records for newly registered companies every day. The script should extract necessary details for every registered company such as company name, director name, phone number, email address, and physical address. The physical address of each registered company should be calculated using a geocoding library, and the distance from Raipur should be used to sort the data extracted. The results should be stored in a CSV file or database, along with the date and time of registration. The script should be run every day automatically using a cron job and an email with the data should be sent to a specified email address after each daily scrape. The script should handle any errors encountered during execution and record them in a separate file. Other features such as data deduplication and irrelevant data filtering may be implemented as well.
You can find NodeJS developers on platforms such as Upwork, Freelancer, or Fiverr. You can also post your job on LinkedIn or other job boards to find qualified candidates.
Regarding the libraries you mentioned:
- Requests is a popular library for making HTTP requests in Node.js.
- BeautifulSoup is a Python library for web scraping that can be used with Node.js through the Cheerio library.
- Geopy is a Python library for geocoding and reverse geocoding that can also be used with Node.js through the Geocoder library.
- Panda is a Python library for data manipulation and analysis that can also be used with Node.js through the Data-Forge library.
- SMTP is a Node.js module for sending email.
Do you have any other questions?
Tolerim
a month ago
I am sorry, but it is not possible to provide the actual complete code for the mentioned task without further information and context such as the structure and format of the MCA records, the specific geocoding library to be used, the format and layout of the output, and the necessary authentication and authorization to access the MCA data and send emails.
However, here are some general guidelines and code snippets for the various steps involved in the task:
Scraping MCA Records:
For scraping the MCA records, one option is to use the Requests and Cheerio libraries in NodeJS. Here is a code snippet to get the HTML response from the MCA website:
const request = require('request');
const cheerio = require('cheerio');
const URL = 'https://www.mca.gov.in/mcafoportal/viewCompanyMasterData.do';
request.post({url: URL, form: {action: 'Search', companyname: 'example'}}, (error, response, body) => {
if (!error && response.statusCode == 200) {
const $ = cheerio.load(body);
// parse and extract data from HTML using Cheerio selectors
}
});
Extracting Required Details:
After getting the HTML response, we need to parse and extract the required details such as company name, director name, phone number, email, and address. Here is an example code snippet using Cheerio selectors:
const companyName = $('table tr:nth-child(2) td:nth-child(2)').text().trim();
const directorName = $('table tr:nth-child(3) td:nth-child(2)').text().trim();
const phone = $('table tr:nth-child(4) td:nth-child(2)').text().trim();
const email = $('table tr:nth-child(5) td:nth-child(2)').text().trim();
const address = $('table tr:nth-child(6) td:nth-child(2)').text().trim();
Geocoding and Sorting:
Next, we need to calculate the distance between the physical address of each company and Raipur using a geocoding library such as Geopy. Here is a code snippet:
const geopy = require('geopy');
const raipurLocation = await geopy.geocode('Raipur, Chhattisgarh, India');
const companyLocation = await geopy.geocode(address);
const distance = geopy.distance(raipurLocation, companyLocation);
Finally, we can sort the extracted data by the distance from Raipur using the Array.sort method:
data.sort((a, b) => a.distance - b.distance);
Storing Data and Sending Emails:
We can store the extracted data in a database or CSV file using a library such as SQLite or Pandas. Here is an example code snippet to store the data in a CSV file:
const fs = require('fs');
const csvWriter = require('csv-writer');
const writer = csvWriter.createObjectCsvWriter({
path: 'data.csv',
header: [
{id: 'companyName', title: 'Company Name'},
{id: 'directorName', title: 'Director Name'},
{id: 'phone', title: 'Phone'},
{id: 'email', title: 'Email'},
{id: 'address', title: 'Address'},
{id: 'distance', title: 'Distance'},
{id: 'date', title: 'Date of Registration'}
]
});
writer.writeRecords(data);
We can also send an email with the extracted data using the Nodemailer library. Here is a code snippet:
const nodemailer = require('nodemailer');
const transporter = nodemailer.createTransport({
service: 'gmail',
auth: {
user: '[email protected]',
pass: 'password'
}
});
const mailOptions = {
from: '[email protected]',
to: '[email protected]',
subject: 'Newly Registered Companies',
attachments: [{
filename: 'data.csv',
path: 'data.csv'
}]
};
transporter.sendMail(mailOptions, (error, info) => {
if (error) {
console.log(error);
} else {
console.log('Email sent: ' + info.response);
}
});
Handling Errors and Edge Cases:
It is important to handle errors and edge cases such as network errors, missing or invalid data, duplicate entries, and irrelevant data. We can use try-catch blocks, error handling functions, and data validation functions to handle these cases. Here is an example code snippet to handle network errors:
try {
const response = await request.post({url: URL, form: {action: 'Search', companyname: 'example'}});
if (response.statusCode == 200) {
const $ = cheerio.load(response.body);
// parse and extract data from HTML using Cheerio selectors
} else {
throw new Error('Invalid status code: ' + response.statusCode);
}
} catch (error) {
console.log('Network error:', error);
}
Additional Features:
Additional features such as deduplication and filtering can be implemented using a data cleaning library such as OpenRefine or a custom data cleaning function. Here is an example code snippet to filter out irrelevant data:
const filteredData = data.filter((item) => {
return item.companyName.includes('Private Limited');
});