Today more adventures about Grammarly and scraping data from checking-results.

S0-E18/E30 :)

Grammarly results scraping continues

After yesterday's article about possibility to scrape some results from Grammarly checking, let's now make a more advanced scraping :)

Let's make an example with more issues, so we can experiment with BeautifulSoap and scrape more with it.

I've found the "Demo document" that is presented within first usage of Grammarly, contains a lot of issues to present features of Grammarly. So let's use it :)

Scraping:

    def tests_all_in_one(self):
        from selenium import webdriver
        self.driver = webdriver.Firefox()
        filename = "demo_document.txt"
        demo_data_text = None
        with open(filename, 'r+') as f:
            demo_data_text = f.read().decode("utf-8")
        page_login = GrammarlyLogin(self.driver)
        page_login.make_login('za2217279@mvrht.net', 'test123')
        page_new_doc = GrammarlyNewDocument(self.driver)
        page_new_doc.make_new_document("")
        page_doc = GrammarlyDocument(self.driver)
        page_doc.put_title("DEMO DATA TEXT")
        page_doc.put_text(demo_data_text)
        page_source = GrammarlyDocument(self.driver)
        actual_source = page_source.get_page_source()
        scraper = DocumentScraper(actual_source)
        found_issues = scraper.find_all_issues()
        assert len(found_issues) == 14
        issues_by_type = scraper.return_issues_by_type()
        assert len(issues_by_type) == 2
        assert u'_ed4374-plainTextTitle' in issues_by_type
        assert u'_ed4374-titleReplacement' in issues_by_type
        assert len(issues_by_type['_ed4374-plainTextTitle']) == 3
        assert len(issues_by_type['_ed4374-titleReplacement']) == 11

And the source of Document Scraper:

from bs4 import BeautifulSoup

class DocumentScraper(object):

    def __init__(self, html_source):
        # self.html_source = html_source
        self.bs = BeautifulSoup(html_source, "html.parser")

    def get_issue_div(self):
        # DIV with class=_adbfa1e6-editor-page-cardsCol
        return self.bs.find('div', {'class': '_adbfa1e6-editor-page-cardsCol'})

    def get_all_warnings(self):
        return self.get_issue_div().contents

    def get_all_warnings_texts(self):
        return [element.text for element in  self.get_all_warnings()]

    def iterate_over_warnings(self):
        for innerelement in self.get_all_warnings():
            print innerelement.text

    def find_all_issues(self):
        return self.bs.findAll('div', {'class': '_ed4374-title'})

    def return_issues_by_type(self):
        issues = self.find_all_issues()
        output = {}
        for issue in issues:
            key = issue.contents[0].attrs['class'][0]
            try:
                output[key].append(issue.contents[0].contents)
            except KeyError:
                output[key] = [issue.contents[0].contents]

        return output

Where demo_document.txt contains "Demo Document" from Grammarly.

This gives you and example of how one can make our previous automation usable and transform it to something more.

Selenium Screenshots for better understanding Grammarly found issues.

Taking into account that there is no Selenium PDF generator - let's at least make a screenshot of page so we can then know where exactly those issues are:

        self.driver.save_screenshot('grammarly_checks.png')
        self.driver.get_screenshot_as_file('grammarly_checks2.png')

The only problem I've found with this is that it does not make a full-page screenshot. I remember that within Java that trick worked and you could make a full-web-page screenshot (and the webbrowser).

I might comeback to this :)

Acknowledgements

Thanks!

That's it :) Comment, share or don't :)

If you have any suggestions what I should blog about in the next articles - please give me a hint :)

See you tomorrow! Cheers!



Comments

comments powered by Disqus