Frank Andrade and Python Automation : All That Can Go Wrong

https://www.youtube.com/embed/PXMJ6FS7llk. Marry why? So I know what's doable and can then go to freelancers on Upwork to build me stuff..

If you haven't already, install lxml before you get thrashed by pd.read_html(). If you're on Python 3, you need to use pip3

>>> pd.read_html( "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population")

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\io\html.py:872, in _parser_dispatch(flavor)
    870 else:
    871     if not _HAS_LXML:
--> 872         raise ImportError("lxml not found, please install it")
    873 return _valid_parsers[flavor]

ImportError: lxml not found, please install it

Didn't work! Restarted the notebook. Didn't help!

Tried

!pip3 install lxml

in the notebook and it shows a fresh install. And? Still doesn't work.

Note that read_html *does* work in a console instance of python (IDLE). So why does it fail in the jupyter notebook?

Time for Upwork 😊 And?

By the power of Shehroz. Don't just go by what WSL's python shows you. You might have a completely different python version in your Jupyter notebook (I did - it was 3.9 - something to do with the install that Visual Studio Code had done) and WSL had 3.8. What should you do? In Jupyter, open a terminal (unless you like typing commands with a ! prefix in the notebook) and

pip3 uninstall pandas
pip3 uninstall lxml
pip3 install pandas
pip3 install --upgrade lxml

And now you should be good to do. Bottom line - forget what you see in the terminal from which you launch Jupyter. Use the terminal within Jupyter.

Read all tables from a web-page into a list :

tennis = pd.read_html("https://en.wikipedia.org/wiki/List_of_Grand_Slam%E2%80%93related_tennis_records")

Ensure you're not shooting yourself in the foot by running (unintentionally) Jupyter on a different host than you think you are by using the same port as an already-running instance. This can happen if you start the first instance from a Git bash terminal (mingw64) and the second one from WSL.

IMO Frank (using PyCharm BTW) could work on his quality. Literally every step can take hours based on what can go wrong..

Within 15 minutes, he into reading tables from PDFs. This guy is obviously good, but with no thought given to what can go wrong, what value is he really contributing other than a glorified table of contents?

The thing is, this video is not that old - barely over a month. He gets this to work :

tables[0].export('/tmp/poker.csv')

And I get :

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [22], in ()
----> 1 tables[0].export('/tmp/poker.csv')

AttributeError: 'Table' object has no attribute 'export'

Could it be a Mac vs PC thing? If I do :

tables.export('/tmp/poker.csv', f='csv', compress=True)  # creates poker.zip

I get a zip file that has a CSV when unzipped.

What I take issue with is just using only the pre-canned example from the Camelot documentation page. That one works fine.

Why doesn't this one work? Make a table in Excel. Paste that into PowerPoint and now Save As PDF. Open up that PDF and you can select text. That, per Camelot documentation is good enough. Try that with Camelot and you get zilch. Why? Go figure..

I can select text but Camelot fails to extract table from PDF

For a change, Google actually proves useful and I find that you to specify the "flavor". The default read_pdf assumes the table cells are demarcated with lines. Not true? Then specify flavor='stream'

In this particular case, if you do this, you win :

tbl2 = camelot.read_pdf('/home/ananth/win/junk/camelot_test.pdf', pages='1', flavor='stream')
tbl2.export('/home/ananth/win/Downloads/camelot_test.csv', f='csv')

Which makes you wonder about the guys who built this package. What are users supposed to know about the PDF? What if you wanted to scour all tables and process them and they're not all a uniform flavor? What then?

Moving on. Next, is a quick intro to XPath to extract tags. That playground is a fun thing - whoever set that up - thanks 😊

I didn't know you needed an independent install of Chrome in WSL2 to be able to use Selenium there. Probably enough reason (conserving disk space) for MS to come up with more integration. Anyhow, here's all I had to do. BLUF - no audio, and fixing *that* needs a true geek 😊

sudo apt-get update
sudo apt-get install -y curl unzip xvfb libxi6 libgconf-2-4
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo apt install ./google-chrome-stable_current_amd64.deb
google-chrome --version
wget https://chromedriver.storage.googleapis.com/103.0.5060.134/chromedriver_linux64.zip
rm google-chrome-stable_current_amd64.deb
unzip chromedriver_linux64.zip
sudo mv chromedriver /usr/bin/
sudo chown root:root /usr/bin/chromedriver
sudo chmod +x /usr/bin/chromedriver
chromedriver --version
which chromedriver

This bit is cool - if you get to seeing "Chrome is being controlled by automated test software", Congratulations 😊

All it takes is (this is NOT headless mode, so it will throw up a Chrome window) :

from selenium import webdriver
from selenium.webdriver.chrome.service import Service # wasn't needed in SE3, but needed in SE4

website = "https://finance.yahoo.com/"
cdrv = "/usr/bin/chromedriver"

service = Service(executable_path=cdrv)
driver = webdriver.Chrome(service=service)
driver.get(website)

It would be nice to know how to gracefully shutdown the Chrome and the driver, wouldn't it?😊

OK, more surprises..

Python doesn't like :

elements = driver.find_elements(by="xpath", value='//h2[contains(@class,"Fz")]/text()')
------------------------------------------
InvalidSelectorException: Message: invalid selector: The result of the xpath expression "//h2[contains(@class,"Fz")]/text()" is: [object Text]. It should be an element.
  (Session info: chrome=103.0.5060.134)

Nice

Python Kaizen - The Infinite Loop of Improvement

Search This Blog

One Path to Bankruptcy for Replit

Frank Andrade and Python Automation : All That Can Go Wrong

Comments

Post a Comment

Popular posts from this blog

Align an Embedded Image in Jupyter Markdown

openCV : Really Filtering by Color

One Path to Bankruptcy for Replit