Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
Markdown always converts html Easily
3.
Frequently Asked Questions
3.1.
What is the significance of using Lambda over other technologies?
3.2.
What is the misconception regarding the ARM 64 architecture?
3.3.
Does serverless mean no servers? 
3.4.
What are the supported languages to run your code on AWS?
4.
Conclusion
Last Updated: Mar 27, 2024

HTML to Markdown with a Server-less function

Master Python: Predicting weather forecasts
Speaker
Ashwin Goyal
Product Manager @

Introduction

To convert HTML to markdown is a very easy task. Before converting HTML to markdown first we should know what is markdown.

Markdown is a plain text formatting syntax aimed at making writing for the internet easier. The philosophy behind Markdown is that plain text documents should be readable without tags mussing everything up, but there should still be ways to add text modifiers like lists, bold, italics, etc. It is an alternative to WYSIWYG (what you see is what you get) editors, which use rich text that later gets converted to proper HTML.

It’s possible you’ve encountered Markdown without realising it. Facebook chat, Skype and Reddit all let you use different flavours of Markdown to format your messages. Here’s a quick example: to make words bold using Markdown, you simply enclose them in * (asterisks). So, *bold word* would look like bold word when everything is said and done.

All told, Markdown is a great way to write for the web using plain text. Markdown, it’s elementary to use, It’s fast and easy to learn. Markdown is simple to learn. To know more about the syntax you can use this website as a reference.

But most of what you’ll need to know is that typing word will make it bold, typing a word or word will italicize the word, links are written like this anchor text, and lists are written exactly how you’d expect: just hit enter and add any of these three characters at the start of each line: -, *, +. 

So this:

  • List item 1
  • List item 2
  • List item 3
     

Becomes this:

  • List item 1
  • List item 2
  • List item 3


So typing Markdown is almost always faster than writing with a rich text editor, especially when you start getting into things like links or bulleted lists, which either make you use the mouse or force you to memorize a complicated sequence of keyboard shortcuts. One caveat is that if you need complicated text elements, such as tables, you’re better off sticking to HTML. Fortunately, Markdown has full HTML support, so you can code a table in HTML and go right back to Markdown in the same document.

Plus, it’s much easier to read raw Markdown than it is to read raw HTML. Which, you know, was part of the reason Markdown was even invented.

Markdown always converts html Easily

Now, if you’re going to be writing HTML, you should just…write HTML. But if you’re, say, writing an email or a readme file where you need HTML’s formatting options but not the full breadth of its features, Markdown is perfect.

Markdown converts to HTML flawlessly, sparing you the hassle of opening and closing all those tags. So. Many. Tags. In fact, Markdown has the software to convert the plain text to HTML built-in! So Markdown is actually a text-to-HTML conversion software in addition to being a markup language. Plus, have you ever tried to convert from a .docx file to HTML? You often get so much extra formatting and spacing that it’s not worth the effort.

To convert html to markdown there are several software one of which is AWS.

Example of how to use markdown in AWS lamda function.

import markdown
def lambda_handler(event, context):
	parsedHTML = {}
	parsedHTML[u’text’] = markdown.markdown(event[u’text’])
	return parsedHTML


To convert html markdown with serverless function we need AWS account.

Outlined below is the setup for an AWS lambda function which combines fetching the HTML for a URL, stripping it back to just the essential article content, and then converting it to Markdown. To deploy it you’ll need an AWS account, and to have the serverless framework installed.

There are four steps which is very important to follow:

Step 1: In the first step, Download the full HTML for the URL First get the full HTML of the URL getting converted. As this is running in a lambda function I decided to try out an ultra-lightweight node HTTP client called phin.

Here is the code to do step 1 process:
 

const phin = require(‘phin’)
const fetchPageHtml async fetchUrl => {
  const response = await phin(fetchUrl)
  return response.body;
};


Step 2: In the second step, Convert to readable HTML. Before converting to markdown its a good idea to strip out the unnecessary parts of the HTML (adverts, menus, images, etc), and just display the text of the main article in a clean and less distracting way. This process won’t work for every web page – it is designed for blog posts, news articles etc which have a clear “body content” section which can be the focus of the output. Mozilla has open-sourced its code for doing this in a Readability library, which can be reused here:
 

const readability = require(“readability”);
const JSDOM = require(“jsdom”).JSDOM;
const extractMainContent = (pageHtml, url) => {
    const doc = new JSDOM(pageHtml, {
      url,
    });
    const reader = new Readability(doc.window.document);
    const article = reader.parse();
    return article.content;


This returns the HTML for just the article in a more readable form.

Step 3: In step 3, There is a CLI tool called pandoc which converts HTML to markdown. The elevator pitch for pandoc is: If you need to convert files from one markup format into another, pandoc is your go-to knife. To try this out locally before running it from the lambda function, you can follow one of their installation methods, and then test it from the command line by piping an HTML file as the input:

This is the command:

cat sample.html | pandoc -f html -t commonmark-raw_html+backtick_code_blocks –wrap none


The options used here are:

• -f html
is the input format
• -t commonmark
is the output format (a particular markdown flavour)

You can add extra configuration options to the output by adding them to the output name.

commonmark-raw_html+backtick_code_block
sets the converter to disable the
raw_html


extension, so no plain html is included in the output.

It enables the  backtick_code_blocks extension so that any code blocks are fenced with backticks rather than being indented.

The pandoc tool needs to be executed from within the node script, which involves spawning it in a child process, writing the HTML to the child stdin and then collect the markdown output via the child.

stdout

Most of these functions have been taken from this very helpful blog post on working with stdout and stdin in nodejs. First off this is the generic streamWrite function, which allows you to pipe the HTML to the pandoc process, by writing to the 

stdin

stream of the child process.
 

const streamWrite = async (stream, chunk, encoding = ‘utf8’) =>
  new Promise((resolve, reject) => {
    const errListener = (err) => {
      stream.removeListener(‘error’, errListener);
      reject(err);
    };
    stream.addListener(‘error’, errListener);
    const callback = () => {
      stream.removeListener(‘error’, errListener);
      resolve(undefined);
    };
    stream.write(chunk, encoding, callback);
  });


This similar function reads from the stdout stream of the child process, so you can collect the markdown that is the output:
 

const {
  chunksToLinesAsync,
  chomp
} = require(‘@rauschma / stringio’);
const collectFromReadable = async (readable) => {
  let lines = [];
  for await (const line of chunksToLinesAsync(readable)) {
    lines.push(chomp(line));
  }
  return lines;
}


Finally, this helper function converts the callback events for the child process into an “awaitable” async function:
 

const onExit = async (childProcess) =>
  new Promise((resolve, reject) => {
    childProcess.once(‘exit’, (code) => {
      if (code === 0) {
        resolve(undefined);
      } else {
        reject(new Error(‘Exit with error code: ‘+code));
      }
    });
    childProcess.once(‘error’, (err) => {
      reject(err);
    });
  });


To make the API a bit cleaner, here is that all wrapped up in a single helper function:
 

const spawnHelper = async (command, stdin) => {
  const commandParts = command.split(”“);
  const childProcess = spawn(commandParts[0], commandParts.slice(1))
  await streamWrite(childProcess.stdin, stdin);
  childProcess.stdin.end();
  const outputLines = await collectFromReadable(childProcess.stdout);
  await onExit(childProcess);
  return outputLines.join(“\n”);
}


This makes calling pandoc from the node script much simpler:
 

const convertToMarkdown = async (html) => {
  const convertedOutput = await spawnHelper(‘/opt/bin / pandoc - f html - t commonmark - raw_html + backtick_code_blocks– wrap none’, html)
  return convertedOutput;
}


This makes calling pandoc from the node script much simpler:
 

const convertToMarkdown = async (html) => {
  const convertedOutput = await spawnHelper(‘/opt/bin / pandoc - f html - t commonmark - raw_html + backtick_code_blocks– wrap none’, html)
  return convertedOutput;
}


To run this as an AWS lambda you need to include the pandoc binary. This is achieved by adding a shared lambda layer which includes a precompiled pandoc binary. You can build this yourself, or just include the public published layer in your serverless config.

# function config

layers:

  • arn:aws:lambda:us-east-1:145266761615:layer:pandoc:1
     

Step 4: In step 4, we need to wrap this up in the lamba handler function Export a function from this module which has been configured as the handler. This is the function AWS will run every time the lambda receives a request.
 

module.exports.endpoint = async (event) => {
  const url = event.body
  const pageHtml = await fetchPageHtml(url);
  const article = await extractMainContent(pageHtml, url);
  const bodyMarkdown = await convertToMarkdown(article.content);
  // add the title and source url to the top of the markdown
  const markdown = # ${article.title}\n\nSource: ${url}\n\n${bodyMarkdown}
  return {
    statusCode: 200,
    body: markdown,
    headers: {
      ‘Content - type’: ‘text / markdown’
    }
  }
}


This is the full serverless.yml configuration that is needed for serverless to deploy everything:

service: url-to-markdown

frameworkVersion: “>=1.1.0 <2.0.0”

provider:
name: aws
runtime: nodejs12.x
region: us-east-1
 

functions:
downloadAndConvert:
handler: handler.endpoint
timeout: 10

layers:
– arn:aws:lambda:us-east-1:145266761615:layer:pandoc:1

events:

  • HTTP:
    path: convert
    method: post
     

To wrap up we can deploy and test it from the command line like below:

curl -X POST -d ‘https://www.atlasobscura.com/articles/actual-1950s-proposal-nuke-alaska’ https://zm13c3gpzh.execute-api.us-east-1.amazonaws.com/dev/convert

So these are the steps to configure markdowns and AWS lamba please always make sure you have an AWS account.

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

Frequently Asked Questions

What is the significance of using Lambda over other technologies?

With Lambda, you're charged based on the number of requests for your functions as well as the duration (the amount of time it takes for your code to execute) down to the millisecond.

What is the misconception regarding the ARM 64 architecture?

In general, ARM is less expensive per unit of work done on AWS. But don't make the mistake of thinking that ARM is the quick little brother who will launch your lambdas while x86 is still booting up. It will not significantly speed up the launch of your virtualized environment. It will simply perform more work for less money.

Does serverless mean no servers? 

Serverless does not imply that servers are no longer required; it simply means that they are not defined or controlled by the user.

What are the supported languages to run your code on AWS?

  • C# programming language and .net framework
  • C++ programming language SDK
  • Go programming language SDK
  • java programming language
  • Javascript programming language
  • PHP programming language
  • Kotlin programming language
  • python programming language
  • ruby programming language
  • Swift programming language

Conclusion

To conclude the discussion, we’ve extensively looked at converting HTML to Markdown with a server-less function, where we’ve looked upon the different Lambda concepts, features, and so forth. At the last, we’ve also discussed some frequently asked questions.

We hope this article has helped you, but the knowledge never stops, have a look at more related articles: Amazon API Gateway, Amazon Personalize, Amazon Lex, and many more.

Refer to our carefully curated articles and videos and code studio library if you want to learn more. Refer to our guided paths on Coding Ninjas Studio to learn more about DSA, Competitive Programming, JavaScript, System Design, etc. Enrol in our courses and refer to the mock test and problems available. Take a look at the interview experiences and interview bundle for placement preparations.

Previous article
Server-Sent Events(SSEs) in HTML
Next article
HTML <button> Tag
Live masterclass