XPath/Python - How to get different html tags and text inside a <div> -
i'm trying scrape html content @ url: http://www.dlib.org/dlib/november14/beel/11beel.html python sintax:
s="http://www.dlib.org/dlib/november14/beel/11beel.html" content = requests.get(s) tree = html.fromstring(content.text) titoli = tree.xpath('/html/body/form/table[3]/tr/td/table[5]/tr/td/table[1]/tr/td[2]/h3/text()') par = tree.xpath('/html/body/form/table[3]/tr/td/table[5]/tr/td/table[1]/tr/td[2]/p/text()') articoli = json.dumps({'titoli':titoli,'contenuti':par}) print ("content-type: json") print print (articoli)
the main request find xpath query return every tags, tags content , text inside useful div of page, can find path /html/body/form/table[3]/tr/td/table[5] or using web inspector under commented line: !-- content table --. code i've posted before not possible entire content of div titles , text inside p div, can't find way.
to actual html content of section of website using python/xpath, easier use from lxml import etree
instead of from lxml import html
. when set element tree, there function allows return html content of element, rather returning text content (as mentioned). code follows:
from lxml import etree import requests s = "http://www.dlib.org/dlib/november14/beel/11beel.html" page = requests.get(s) tree = etree.html(page.text) element = tree.xpath('./body/form/table[3]/tr/td/table[5]') content = etree.tostring(element[0])
tree.xpath
returns list of selected elements. in case, because using specific xpath, returns list containing 1 element. therefore have use etree.tostring(element[0])
access first element of list , return html content of element string.
Comments
Post a Comment