Introduction
To convert HTML to markdown is a very easy task. Before converting HTML to markdown first we should know what is markdown.
Markdown is a plain text formatting syntax aimed at making writing for the internet easier. The philosophy behind Markdown is that plain text documents should be readable without tags mussing everything up, but there should still be ways to add text modifiers like lists, bold, italics, etc. It is an alternative to WYSIWYG (what you see is what you get) editors, which use rich text that later gets converted to proper HTML.
It’s possible you’ve encountered Markdown without realising it. Facebook chat, Skype and Reddit all let you use different flavours of Markdown to format your messages. Here’s a quick example: to make words bold using Markdown, you simply enclose them in * (asterisks). So, *bold word* would look like bold word when everything is said and done.
All told, Markdown is a great way to write for the web using plain text. Markdown, it’s elementary to use, It’s fast and easy to learn. Markdown is simple to learn. To know more about the syntax you can use this website as a reference.
But most of what you’ll need to know is that typing word will make it bold, typing a word or word will italicize the word, links are written like this anchor text, and lists are written exactly how you’d expect: just hit enter and add any of these three characters at the start of each line: -, *, +.
So this:
- List item 1
- List item 2
-
List item 3
Becomes this:
- List item 1
- List item 2
- List item 3
So typing Markdown is almost always faster than writing with a rich text editor, especially when you start getting into things like links or bulleted lists, which either make you use the mouse or force you to memorize a complicated sequence of keyboard shortcuts. One caveat is that if you need complicated text elements, such as tables, you’re better off sticking to HTML. Fortunately, Markdown has full HTML support, so you can code a table in HTML and go right back to Markdown in the same document.
Plus, it’s much easier to read raw Markdown than it is to read raw HTML. Which, you know, was part of the reason Markdown was even invented.
Markdown always converts html Easily
Now, if you’re going to be writing HTML, you should just…write HTML. But if you’re, say, writing an email or a readme file where you need HTML’s formatting options but not the full breadth of its features, Markdown is perfect.
Markdown converts to HTML flawlessly, sparing you the hassle of opening and closing all those tags. So. Many. Tags. In fact, Markdown has the software to convert the plain text to HTML built-in! So Markdown is actually a text-to-HTML conversion software in addition to being a markup language. Plus, have you ever tried to convert from a .docx file to HTML? You often get so much extra formatting and spacing that it’s not worth the effort.
To convert html to markdown there are several software one of which is AWS.
Example of how to use markdown in AWS lamda function.
import markdown
def lambda_handler(event, context):
parsedHTML = {}
parsedHTML[u’text’] = markdown.markdown(event[u’text’])
return parsedHTML
To convert html markdown with serverless function we need AWS account.
Outlined below is the setup for an AWS lambda function which combines fetching the HTML for a URL, stripping it back to just the essential article content, and then converting it to Markdown. To deploy it you’ll need an AWS account, and to have the serverless framework installed.
There are four steps which is very important to follow:
Step 1: In the first step, Download the full HTML for the URL First get the full HTML of the URL getting converted. As this is running in a lambda function I decided to try out an ultra-lightweight node HTTP client called phin.
Here is the code to do step 1 process:
const phin = require(‘phin’)
const fetchPageHtml async fetchUrl => {
const response = await phin(fetchUrl)
return response.body;
};
Step 2: In the second step, Convert to readable HTML. Before converting to markdown its a good idea to strip out the unnecessary parts of the HTML (adverts, menus, images, etc), and just display the text of the main article in a clean and less distracting way. This process won’t work for every web page – it is designed for blog posts, news articles etc which have a clear “body content” section which can be the focus of the output. Mozilla has open-sourced its code for doing this in a Readability library, which can be reused here:
const readability = require(“readability”);
const JSDOM = require(“jsdom”).JSDOM;
const extractMainContent = (pageHtml, url) => {
const doc = new JSDOM(pageHtml, {
url,
});
const reader = new Readability(doc.window.document);
const article = reader.parse();
return article.content;
This returns the HTML for just the article in a more readable form.
Step 3: In step 3, There is a CLI tool called pandoc which converts HTML to markdown. The elevator pitch for pandoc is: If you need to convert files from one markup format into another, pandoc is your go-to knife. To try this out locally before running it from the lambda function, you can follow one of their installation methods, and then test it from the command line by piping an HTML file as the input:
This is the command:
cat sample.html | pandoc -f html -t commonmark-raw_html+backtick_code_blocks –wrap none
The options used here are:
• -f html
is the input format
• -t commonmark
is the output format (a particular markdown flavour)
You can add extra configuration options to the output by adding them to the output name.
commonmark-raw_html+backtick_code_block
sets the converter to disable the
raw_html
extension, so no plain html is included in the output.
It enables the backtick_code_blocks extension so that any code blocks are fenced with backticks rather than being indented.
The pandoc tool needs to be executed from within the node script, which involves spawning it in a child process, writing the HTML to the child stdin and then collect the markdown output via the child.
stdout
Most of these functions have been taken from this very helpful blog post on working with stdout and stdin in nodejs. First off this is the generic streamWrite function, which allows you to pipe the HTML to the pandoc process, by writing to the
stdin
stream of the child process.
const streamWrite = async (stream, chunk, encoding = ‘utf8’) =>
new Promise((resolve, reject) => {
const errListener = (err) => {
stream.removeListener(‘error’, errListener);
reject(err);
};
stream.addListener(‘error’, errListener);
const callback = () => {
stream.removeListener(‘error’, errListener);
resolve(undefined);
};
stream.write(chunk, encoding, callback);
});
This similar function reads from the stdout stream of the child process, so you can collect the markdown that is the output:
const {
chunksToLinesAsync,
chomp
} = require(‘@rauschma / stringio’);
const collectFromReadable = async (readable) => {
let lines = [];
for await (const line of chunksToLinesAsync(readable)) {
lines.push(chomp(line));
}
return lines;
}
Finally, this helper function converts the callback events for the child process into an “awaitable” async function:
const onExit = async (childProcess) =>
new Promise((resolve, reject) => {
childProcess.once(‘exit’, (code) => {
if (code === 0) {
resolve(undefined);
} else {
reject(new Error(‘Exit with error code: ‘+code));
}
});
childProcess.once(‘error’, (err) => {
reject(err);
});
});
To make the API a bit cleaner, here is that all wrapped up in a single helper function:
const spawnHelper = async (command, stdin) => {
const commandParts = command.split(”“);
const childProcess = spawn(commandParts[0], commandParts.slice(1))
await streamWrite(childProcess.stdin, stdin);
childProcess.stdin.end();
const outputLines = await collectFromReadable(childProcess.stdout);
await onExit(childProcess);
return outputLines.join(“\n”);
}
This makes calling pandoc from the node script much simpler:
const convertToMarkdown = async (html) => {
const convertedOutput = await spawnHelper(‘/opt/bin / pandoc - f html - t commonmark - raw_html + backtick_code_blocks– wrap none’, html)
return convertedOutput;
}
This makes calling pandoc from the node script much simpler:
const convertToMarkdown = async (html) => {
const convertedOutput = await spawnHelper(‘/opt/bin / pandoc - f html - t commonmark - raw_html + backtick_code_blocks– wrap none’, html)
return convertedOutput;
}
To run this as an AWS lambda you need to include the pandoc binary. This is achieved by adding a shared lambda layer which includes a precompiled pandoc binary. You can build this yourself, or just include the public published layer in your serverless config.
# function config
layers:
-
arn:aws:lambda:us-east-1:145266761615:layer:pandoc:1
Step 4: In step 4, we need to wrap this up in the lamba handler function Export a function from this module which has been configured as the handler. This is the function AWS will run every time the lambda receives a request.
module.exports.endpoint = async (event) => {
const url = event.body
const pageHtml = await fetchPageHtml(url);
const article = await extractMainContent(pageHtml, url);
const bodyMarkdown = await convertToMarkdown(article.content);
// add the title and source url to the top of the markdown
const markdown = # ${article.title}\n\nSource: ${url}\n\n${bodyMarkdown}
return {
statusCode: 200,
body: markdown,
headers: {
‘Content - type’: ‘text / markdown’
}
}
}
This is the full serverless.yml configuration that is needed for serverless to deploy everything:
service: url-to-markdown
frameworkVersion: “>=1.1.0 <2.0.0”
provider:
name: aws
runtime: nodejs12.x
region: us-east-1
functions:
downloadAndConvert:
handler: handler.endpoint
timeout: 10
layers:
– arn:aws:lambda:us-east-1:145266761615:layer:pandoc:1
events:
-
HTTP:
path: convert
method: post
To wrap up we can deploy and test it from the command line like below:
curl -X POST -d ‘https://www.atlasobscura.com/articles/actual-1950s-proposal-nuke-alaska’ https://zm13c3gpzh.execute-api.us-east-1.amazonaws.com/dev/convert
So these are the steps to configure markdowns and AWS lamba please always make sure you have an AWS account.