Extracting content (text) from a large website
Автор темы: Penelope Ausejo
Penelope Ausejo
Penelope Ausejo  Identity Verified
Испания
Local time: 12:55
английский => испанский
+ ...
Oct 17, 2010

Good afternoon,

A (direct) client of mine is interested in translating their large website, but they don't have the content extracted. Do you know if there is any program that will allow me to extract all the text to a Word document? I own Trados, but I don't think I can use it for this purpose. I wouldn't want to buy a program for this, since it is not a usual requirement for me.

I have tried to copy each page manually, but it will take me days and I will probably miss
... See more
Good afternoon,

A (direct) client of mine is interested in translating their large website, but they don't have the content extracted. Do you know if there is any program that will allow me to extract all the text to a Word document? I own Trados, but I don't think I can use it for this purpose. I wouldn't want to buy a program for this, since it is not a usual requirement for me.

I have tried to copy each page manually, but it will take me days and I will probably miss parts of it. Each page has lots of links to other parts of the website.

Thank you very much in advance!
Collapse


 
Jaroslaw Michalak
Jaroslaw Michalak  Identity Verified
Польша
Local time: 12:55
Член ProZ.com c 2004
английский => польский
ЛОКАЛИЗАТОР САЙТА
Spider Oct 17, 2010

You might use a web spider, e.g. HTTrack:

http://www.httrack.com/

Please note though that you have to set the options carefully, especially concerning the depth and external links. Also, this works best with static content - if some pages are generated dynamically, you might get different results each time... In that case the underlying database from which dynamic content is created shou
... See more
You might use a web spider, e.g. HTTrack:

http://www.httrack.com/

Please note though that you have to set the options carefully, especially concerning the depth and external links. Also, this works best with static content - if some pages are generated dynamically, you might get different results each time... In that case the underlying database from which dynamic content is created should be given to you. Another problem might be posed by pages which are static and loaded on the site, but are called by specific dynamic addressing (e.g. search argument). Spider will not be able to get them...

Ask the client to provide the site tree or at least the complete site map - this way you could at least compare the number of documents you have downloaded with their numbers.
Collapse


 
José Henrique Lamensdorf
José Henrique Lamensdorf  Identity Verified
Бразилия
Local time: 07:55
английский => португальский
+ ...
Памяти
Extracting content and then translating text Oct 17, 2010

Yes, HTtrack is a good option to download the web site pages. But then it's still html, not text.

It's an unlikely that a direct client will - like some translations agencies - demand that you use Trados at all times, even for things like cooking and bathing. Neither Word seems to be a must, as they want to have the web site translated, not only its contents.

So, considering all the possible limitations Jabberwock has mentioned, a possible solution would be CatsCradle, from http://www.stormdance.net . It will allow you to open each html file, and translate all the text there, without having to see any of the html stuff. Furthermore it has its own internal CAT tool, TM, and a WYSIWYG screen, so you can see how the translated page is coming out.

CatsCradle won't alter the internal links, and this will require some work from your client's webmaster. After all you don't know if, for instance (using my case = Brazil) site.com will become site.com.br, br.site.com, site.com/br or anything else, which will determine the whole structure.
Collapse


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 12:55
английский => венгерский
+ ...
Ask the client to Oct 17, 2010

Jabberwock wrote:

Ask the client to...


That's the key right there. I would finish the sentence with ...put you in contact with the guy that created and maintans the site. That is the only person who will be able to tell you whether it's all just static HTML or there's some complication you'll have to deal with, and that is the only person who can give you the correct source files and tell you what format you'll need to produce.

I could give you a tool that extracts all the text to a word document, but if you translated that way, the IT guy would never be able to put the text back on the site.


 
sokolniki
sokolniki  Identity Verified
США
Local time: 05:55
английский => русский
+ ...
PureText Oct 17, 2010

You can also try http://www.stevemiller.net/puretext/ to remove all the website formatting - really simple.

 
Anna Villegas
Anna Villegas
Мексика
Local time: 04:55
английский => испанский
http://www.webbudget.com/ Oct 17, 2010

I like this one very much, it's easy to manage, and you have 15 days free of charge. Try it!



 
Samuel Murray
Samuel Murray  Identity Verified
Нидерланды
Local time: 12:55
Член ProZ.com c 2006
английский => африкаанс
+ ...
Webdown + OmegaT (just for counting) Oct 17, 2010

Penelope Ausejo wrote:
Do you know if there is any program that will allow me to extract all the text to a Word document? I own Trados, but I don't think I can use it for this purpose.


Can't you pump 1000 HTML files into Trados all at once to do an analysis on it? I'm sure you should be able to do it. Anyway, if you can't, you can do a word count with OmegaT (no need to merge the HTML files into a single file).

To get the HTML files, use the programs the other people suggested, or search the forums here for a post or posts in which I mention Webdown.exe or similar. Great little free tool.


 


To report site rules violations or get help, contact a site moderator:

Модератор(ы) этого форума
Laureana Pavon[Call to this topic]

You can also contact site staff by submitting a support request »

Extracting content (text) from a large website






Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

Buy now! »
Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators.

Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way.

More info »