Despues de ver este capitulo de mejorando.la , senti curiosidad por Scrapy , un framework para scraping y crawling , algo asi como "raspar y arrastrarse" en sitios web y obtener informacion incluidas en las paginas que componen el sitio.
Hace poco tiempo, necesite este framework, porque me encontraba buscando presentaciones sobre Oracle Weblogic 11g y el sitio, que tenia una presentacion bueno, no me permitia bajarla. Asi que se me prendio la lamparita y comence a buscar , en el codigo HTML informacion que me sirviera para obtener las SLIDES.
Requerimientos:
-Scrapy
-Python
-Distrubucion de linux que te guste. (Use Linux Mint)
Para Arrancar el codigo se usa:
scrapy crawl slideshareWeb (Enter)
Estando en el directorio de la applicacion que contruyeron; pero mas claro este en el Tutorial que adjunte como link externo.
import urllib2
import tarfile
import os
from os.path import basename
from urlparse import urlsplit
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from slideshareWeb.items import SlidesharewebItem
class slideshareWebSpider(BaseSpider):
name = "slideshareWeb"
allowed_domains = ["slideshare.net"]
start_urls = [
"http://es.slideshare.net/JustinKestelyn/oracle-weblogic-server-12c-developer-overview"
]
def messages(self, msg_text):
print "--------------------------------------------------------------------------------------"
print msg_text
print "--------------------------------------------------------------------------------------"
def delete_file(self, file_name):
os.remove(file_name)
def make_tar(self, tar_filename, files_compress):
self.messages( "Building Tar File..."+ tar_filename)
tar = tarfile.open(tar_filename, "w:gz")
for name in files_compress:
tar.add(name)
#Delete image file
self.delete_file(name)
tar.close()
return
def parse(self, response):
hxs = HtmlXPathSelector(response)
title = hxs.select('//title/text()').extract()
self.messages("Extracting slides from : " + title[0])
link_slides = hxs.select('//link[contains(@media, "handheld")]/@href').extract()
fileTar = basename(urlsplit(link_slides[0])[2])+'.tar'
stats = hxs.select('//img[@class="slide_image"]')
count = 0
items = []
files_comp = []
for stat in stats:
l_normal = stat.select('@data-normal').extract()
l_full = stat.select('@data-full').extract()
item = SlidesharewebItem()
item['number'] = count
item['link_normal'] = l_normal[0]
item['link_full'] = l_full[0]
items.append(item)
count +=1
try:
imgData = urllib2.urlopen(l_full[0]).read()
fileName = basename(urlsplit(l_full[0])[2])
print "Downloading : ", fileName
output = open(fileName,'wb')
output.write(imgData)
output.close()
files_comp.append(fileName)
except:
pass
if count>0:
try:
self.delete_file(fileTar)
except:
pass
self.make_tar(fileTar, files_comp)
else:
self.messages("NO Slides in " + title[0] + " maybe SWFObject :-( ")
return items
Free as in Freedom interweaves biographical snapshots of GNU project founder Richard Stallman with the political, social and economic history of the free software movement. It examines Stallman's unique personality and how that personality has been at turns a driving force and a drawback in terms of the movement's overall success. Free as in Freedom examines one man's 20-year attempt to codify and communicate the ethics of 1970s era "hacking" culture in such a way that later generations might easily share and build upon the knowledge of their computing forebears. The book documents Stallman's personal evolution from teenage misfit to prescient adult hacker to political leader and examines how that evolution has shaped the free software movement. Like Alan Greenspan in the financial sector, Richard Stallman has assumed the role of tribal elder within the hacking community, a community that bills itself as anarchic and averse to central leadership or authority...
Cuando pense que todo sobre Star Trek, podria estar cubierto, me encontre con esta banda de lunaticos llamados Warp 11, y su Capitan Karl. Miren la opiniones en EEUU: "Warp 11 is one of the best bands in America, not since KISS has a band had this level of shtick, and an amazing sound backing it up." - Mark S. Allen Premiere Radio Networks "Warp 11 is a Trek band that has crossed over from fandom to mainstream music." - Tammy Oler Geek Monthly "This isn't a bunch of Trekkies who decided to rock, but rather it's a buncha rockers who dediced the world of Star Trek was ripe with rockin' potential.. Warp 11 doesn't let any joke get by.. they're so gloriously vulgar!" - Jerry Perry Alive & Kicking "Sactown's own Warp 11 are the bad boys at the Trekkie convention." - Justin Allen Sacramento News & Review "It's Dead, Jim - a typically ass-kicking release from the Warped ones." - A.L. Sirois Sci Fi Channe...
Nuevamente he concurrido a "un día" organizado por el Grupo de Usuarios Oracle de Argentina. Mas precisamente al Día de Alta Disponibilidad ; en la empresa que trabajo estamos usando este tipo de arquitectura y queria tener mas informacion, y distintos puntos de vista y ademas, ver las novedades. Como nos acostumbran estas personas del Grupo de Usuarios, son muy dedicadas en la realización de eventos , sin dejar nada librado al azar; eso hace que participe, aunque sea como oyente, también la calidad de los Speakers.
Comentarios